Big Data with Component Based Software

Who am I Erik who? Erik Forsberg <forsberg@opera.com> Linköping University, 1998-2003. Computer Science programme + lot's of time at Lysator ACS At Opera Software since 2008

Outline Background The problem Components for Processing of Big Data Hadoop Components for Coordination - ZooKeeper Components for Storing of Big Data - Cassandra Gluing it together

Background - What is Opera Software? 1200+ employees 13 locations Norway, Sweden (Linköping [70+ people], Gothenburg, Stockholm), Poland, USA, Japan, China, South Korea, Taiwan, Russia, Ukraine, Iceland, Singapore We make browsers, deliver content, serve ads, compress video, help operators sell data plans, etc.. And use a lot of computers and bandwidth :-)

What am I doing @ Opera Software? I'm the Grumpy Guy in the corner Architect, Opera Statistics Platform Working with around 10 other people in Poland, India, USA Dealing with Hadoop, Zookeeper, Cassandra, python, Java, Nagios, Zabbix, etc.

Background

Background What's our problem? Opera has a multitude of services producing data Opera Mini Opera Coast Opera for Android Opera Mediaworks Opera Discover Etc.. Basically every Opera Service / Product produce some kind of data

Background contd. Some of these produce huge amounts of data Opera Mini produce 5.5 TiB of data daily for some 250 million monthly users

Background What's our problem? We want statistics on.. a lot of things Usage by country? Usage by version? Usage by country and version? Pageviews by country and version? Unique users by country and version? Per day, week, month, quarter and year? Top domains used? Top domains per country? How do we make users stay with our products? Etc..

Problem #1: Big Data Don't try to handle 5.5TiB of daily data using your laptop.. Don't try to store the results in a MySQL database and do JOIN. It doesn't work too well when you're adding 200+ million combinations every day. Internet-scale data

Problem #2: Scalability Our services are growing, and so is the list of required features Must be able to cope with growth by adding more machines, not by replacing one machine with another more expensive and powerful one.

Problem #3: Reliability / Fault tolerance Statistics are important and must be delivered in time, even if machines in data processing cluster go down

Opera Statistics Platform

Opera Statistics Platform Architectural overview

The General Idea How it all works 1. Collect data 2. Produce combinations of variables for fixed timeperiods in Apache Hadoop 3. Store in Apache Cassandra as key/values 4. Access data via Web UI talking to Cassandra

Input data Logs are produced by various clusters We install nginx on each machine. Logs are retrieved over https, with client certificate authentication. Components used: nginx Click Insert > Header & Footer

Getting the logs OSP Log Manager, written in-house. Pro tip: Doing 400 concurrent https connections, per file, works well when communicating with servers in China :-) Currently using 2Gbit/s during approx. 1.5h period Logs are directly uploaded to HDFS Components: Twisted library, python, zookeeper Click Insert > Header & Footer

Processing Big Data Apache Hadoop

Apache Hadoop Map/Reduce (M/R) framework Distributed fault-tolerant filesystem (HDFS) Scalable, just add more computers M/R jobs can be written in Java (native), C++ (pipes), or with any programming language (streaming) Click Insert > Header & Footer

Hadoop M/R example (in python) class MRExample(object): def map(self, key, value): if key.startswith("i_want_this"): yield key, value Input: ( i_want_this_1, 7) ( i_want_this_2, 8) ( dont_want_this, 10) ( i_want_this_1, 7) def reduce(self, key, values): yield key, sum(values) Results: ( i_want_this_1, 14) ( i_want_this_2, 8)

OSP and Hadoop Long chain of M/R jobs Begins with logextract job with directory of logs as input. Continues with jobs that merges timeperiods Then jobs that produce combinations from variables Ends with job that bulkloads permutations to Cassandra Click Insert > Header & Footer

Current Hadoop Production Cluster Hardware, you cannot always avoid it 88 nodes Each with 16 cores, 64GiB RAM, 6TB disk In total 450TB of HDFS storage On Iceland, where Power and Cooling is environmentally friendly and cheap.

OSP Scheduler Keeps track of the chain of jobs One job's output serves as input to another job Multiple chains, executed per timeperiod Multiple machines, multiple worker processes per machine Uses Apache Zookeeper for coordination, locks, distributed queue Click Insert > Header & Footer

OSP Scheduler Keeps track of our chain of events Reads a configuration file (.ini) When one job has completed, jobs depending on it will be started Tasks written in python Either runs on the scheduler node, or starts a Hadoop job Developed in-house. No good alternative available (when we started) Fault-tolerant and Scalable Components used: Python, ZooKeeper

Coordinating Distributed Systems Apache ZooKeeper

Apache ZooKeeper Because coordinating distributed systems is a Zoo Centralized service for distributed synchronization Really simple API, but can be used to build complex distributed services Provides a simple way of not having to worry about doing things like distributed locks right.

Apache ZooKeeper Overview Filesystem-like data structure Znodes Znodes can have data (up to 1MiB) Znodes can have children Atomic changes Versioning Ephemeral nodes Watches

Apache ZooKeeper Fault-tolerance and performance Needs at least 3 servers for redundancy One server goes down service is still up With 5 servers, 2 servers may go down and service is still up Coordinator is dynamically elected through quorum vote Write performance limited by I/O performance on coordinator node Read-performance is good, reads are local on each server. All servers keep full copy of database. Clients connect to a randomly selected node, writes are forwarded to coordinator node

How OSP uses ZooKeeper Distributed queue Multiple producers, multiple consumers Locks Either a znode is present, or it's not (atomicity) Want to run only one cron job at any time, but on multiple nodes ephemeral node with known name Barriers No polling, uses watches Waiting until all tasks in a chain have completed before some action Configuration data Dynamic reconfig of service by watching known znode

Storing Big Data Apache Cassandra

Apache Cassandra A distributed NoSQL solution Distributed key/value store, but with columns Scalable. Keys distributed among nodes. Just add more hardware. Handles large amounts of data Tunable consistency Configurable replication All nodes in cluster are equal. No master node.

NoSQL I was raised with RDBMS and MySQL, this is different No relations Key/Value But with Columns. A Key can have many many columns Think Different How will data be accessed by the application? Not How should I store data in the most efficient way?

Cassandra Data Model Row keys with sparse columns Row keys are hashed and data is distributed among servers Keyspace replication factor (RF) decides number of servers that hold each key RF > X protects against X servers failing RF also decides read performance Columns are sorted, and can be retrieved in slices

Cassandra Cluster Keyspace1, RF=2 Column Family X Column Family Y Click Insert > Header & Footer Keyspace1, RF=2 Column Family Z Column Family Y

Cassandra data model Sparse columns Row key Home address Work e-mail forsberg The outskirts forsberg@oper a.com carrie NY Downtown sexandthecity @gmail.com ola Lambohov Ola.leifler@liu. se Home e-mail carrie@example.com

Cassandra data model No JOIN, use a second Column Family Row key Username forsberg@opera.com forsberg sexandthecity@gmail.com carrie Ola.leifler@liu.se ola

Cassandra storage Sorted String Tables Writes to Cassandra are to commitlog, then to memory Memory is dumped to sstables periodically (or when they get full) Very fast, scalable writes! Sparse columns are stored efficiently. No null values to store. Columns can have TTL, which makes data go away automatically Sstables are merged automatically over time

How OSP use Cassandra Results of Hadoop calculations are stored in Cassandra i.e. country=sweden,shoesize=43 : 7 unique users for Opera Mini Millions of keys stored every day TTL used to make keys go away after our retention period (i.e. daily data is gone after 6 months)

Choosing the right tool for the job

Hadoop vs Cassandra Hadoop Really good Streaming I/O Slow as a melting iceberg on random access. So not useful as backend for UI. Has M/R component Cassandra Really good response times on random access to any row key. Makes it useful as backend for UI. No data crunching component Hadoop can crunch data from Cassandra efficiently

Cassandra vs ZooKeeper Two very different Key/Value (kind of) systems Cassandra Scalable write performance Can keep large blobs Structured data, keys and columns (can have) data types No atomicity, but tunable consistency ZooKeeper Write performance limited by I/O of one node Small amounts of data (1MiB). Data must fit in Heap Just a blob (bsob?) (we use JSON) Atomicity, ephemeral nodes, watches

How do you choose the right component? KISS Don't be afraid of Open Source. But evaluate the community activeness Evaluate multiple alternatives Choose what seems easiest to work with. If it takes 3 days to set it up for an experiment, it probably isn't right.

The UI

The UI I'm not a UI person, but we have other team members who are Written in PHP Uses Highcharts, a fine Norwegian component for drawing SVG graphs Also uses Memcache for some local caching

Tying it together

Tying it all together You need some kind of glue to build a system We choose python Chances are that if you need to communicate with component X, python already has a module for it Really quick to do development in. Get down to business at once!' At Opera we use a multitude of Languages

End Questions?

We are hiring! http://jobs.opera.com