Big Data with Component Based Software

Similar documents
Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

How To Scale Out Of A Nosql Database

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Open Source Technologies on Microsoft Azure

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Hadoop IST 734 SS CHUNG

Practical Cassandra. Vitalii

Cloud Computing at Google. Architecture

Integrating Big Data into the Computing Curricula

A Performance Analysis of Distributed Indexing using Terrier

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

NoSQL and Hadoop Technologies On Oracle Cloud

Database Scalability and Oracle 12c

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

CSE-E5430 Scalable Cloud Computing Lecture 11

NoSQL for SQL Professionals William McKnight

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Certified Big Data and Apache Hadoop Developer VS-1221

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Apache Hadoop. Alexandru Costan

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

NoSQL Database Options

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Open source large scale distributed data management with Google s MapReduce and Bigtable

Comparing SQL and NOSQL databases

The Cloud to the rescue!

HDFS Users Guide. Table of contents

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Assignment # 1 (Cloud Computing Security)

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

An Approach to Implement Map Reduce with NoSQL Databases

Media Upload and Sharing Website using HBASE

HADOOP MOCK TEST HADOOP MOCK TEST II

Cassandra vs MySQL. SQL vs NoSQL database comparison

Time series IoT data ingestion into Cassandra using Kaa

Performance Testing of Big Data Applications

Hypertable Architecture Overview

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

How To Use Big Data For Telco (For A Telco)

NoSQL Data Base Basics

Hadoop and Map-Reduce. Swati Gore

Apache HBase. Crazy dances on the elephant back

Can the Elephants Handle the NoSQL Onslaught?

Large Scale file storage with MogileFS. Stuart Teasdale Lead System Administrator we7 Ltd

SCALABLE DATA SERVICES

Chapter 7. Using Hadoop Cluster and MapReduce

How To Create A Data Visualization With Apache Spark And Zeppelin

Hadoop Architecture. Part 1

Social Networks and the Richness of Data

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Case Study : 3 different hadoop cluster deployments

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

MADOCA II Data Logging System Using NoSQL Database for SPring-8

Large-Scale Web Applications

THE HADOOP DISTRIBUTED FILE SYSTEM

HDFS. Hadoop Distributed File System

Introduction to Big Data Training

ZooKeeper. Table of contents

Big Data Course Highlights

A Brief Outline on Bigdata Hadoop

Sentimental Analysis using Hadoop Phase 2: Week 2

Forensic Clusters: Advanced Processing with Open Source Software. Jon Stewart Geoff Black

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Apache Hadoop FileSystem and its Usage in Facebook

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Constructing a Data Lake: Hadoop and Oracle Database United!

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

ZingMe Practice For Building Scalable PHP Website. By Chau Nguyen Nhat Thanh ZingMe Technical Manager Web Technical - VNG

Big Data Infrastructure at Spotify

the missing log collector Treasure Data, Inc. Muga Nishizawa

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Adding scalability to legacy PHP web applications. Overview. Mario Valdez-Ramirez

Development of nosql data storage for the ATLAS PanDA Monitoring System

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

NoSQL: Going Beyond Structured Data and RDBMS

Apache HBase: the Hadoop Database

Scalable Architecture on Amazon AWS Cloud

Introduction to Apache Cassandra

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

A survey of big data architectures for handling massive data

Big Data and Apache Hadoop s MapReduce

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

A programming model in Cloud: MapReduce

Introduction to Hbase Gkavresis Giorgos 1470

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

GraySort on Apache Spark by Databricks

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

NoSQL. Thomas Neumann 1 / 22

Real-time reporting at 10,000 inserts per second. Wesley Biggs CTO 25 October 2011 Percona Live

Big Data With Hadoop

Transcription:

Big Data with Component Based Software

Who am I Erik who? Erik Forsberg <forsberg@opera.com> Linköping University, 1998-2003. Computer Science programme + lot's of time at Lysator ACS At Opera Software since 2008

Outline Background The problem Components for Processing of Big Data Hadoop Components for Coordination - ZooKeeper Components for Storing of Big Data - Cassandra Gluing it together

Background - What is Opera Software? 1200+ employees 13 locations Norway, Sweden (Linköping [70+ people], Gothenburg, Stockholm), Poland, USA, Japan, China, South Korea, Taiwan, Russia, Ukraine, Iceland, Singapore We make browsers, deliver content, serve ads, compress video, help operators sell data plans, etc.. And use a lot of computers and bandwidth :-)

What am I doing @ Opera Software? I'm the Grumpy Guy in the corner Architect, Opera Statistics Platform Working with around 10 other people in Poland, India, USA Dealing with Hadoop, Zookeeper, Cassandra, python, Java, Nagios, Zabbix, etc.

Background

Background What's our problem? Opera has a multitude of services producing data Opera Mini Opera Coast Opera for Android Opera Mediaworks Opera Discover Etc.. Basically every Opera Service / Product produce some kind of data

Background contd. Some of these produce huge amounts of data Opera Mini produce 5.5 TiB of data daily for some 250 million monthly users

Background What's our problem? We want statistics on.. a lot of things Usage by country? Usage by version? Usage by country and version? Pageviews by country and version? Unique users by country and version? Per day, week, month, quarter and year? Top domains used? Top domains per country? How do we make users stay with our products? Etc..

Problem #1: Big Data Don't try to handle 5.5TiB of daily data using your laptop.. Don't try to store the results in a MySQL database and do JOIN. It doesn't work too well when you're adding 200+ million combinations every day. Internet-scale data

Problem #2: Scalability Our services are growing, and so is the list of required features Must be able to cope with growth by adding more machines, not by replacing one machine with another more expensive and powerful one.

Problem #3: Reliability / Fault tolerance Statistics are important and must be delivered in time, even if machines in data processing cluster go down

Opera Statistics Platform

Opera Statistics Platform Architectural overview

The General Idea How it all works 1. Collect data 2. Produce combinations of variables for fixed timeperiods in Apache Hadoop 3. Store in Apache Cassandra as key/values 4. Access data via Web UI talking to Cassandra

Input data Logs are produced by various clusters We install nginx on each machine. Logs are retrieved over https, with client certificate authentication. Components used: nginx Click Insert > Header & Footer

Getting the logs OSP Log Manager, written in-house. Pro tip: Doing 400 concurrent https connections, per file, works well when communicating with servers in China :-) Currently using 2Gbit/s during approx. 1.5h period Logs are directly uploaded to HDFS Components: Twisted library, python, zookeeper Click Insert > Header & Footer

Processing Big Data Apache Hadoop

Apache Hadoop Map/Reduce (M/R) framework Distributed fault-tolerant filesystem (HDFS) Scalable, just add more computers M/R jobs can be written in Java (native), C++ (pipes), or with any programming language (streaming) Click Insert > Header & Footer

Hadoop M/R example (in python) class MRExample(object): def map(self, key, value): if key.startswith("i_want_this"): yield key, value Input: ( i_want_this_1, 7) ( i_want_this_2, 8) ( dont_want_this, 10) ( i_want_this_1, 7) def reduce(self, key, values): yield key, sum(values) Results: ( i_want_this_1, 14) ( i_want_this_2, 8)

OSP and Hadoop Long chain of M/R jobs Begins with logextract job with directory of logs as input. Continues with jobs that merges timeperiods Then jobs that produce combinations from variables Ends with job that bulkloads permutations to Cassandra Click Insert > Header & Footer

Current Hadoop Production Cluster Hardware, you cannot always avoid it 88 nodes Each with 16 cores, 64GiB RAM, 6TB disk In total 450TB of HDFS storage On Iceland, where Power and Cooling is environmentally friendly and cheap.

OSP Scheduler Keeps track of the chain of jobs One job's output serves as input to another job Multiple chains, executed per timeperiod Multiple machines, multiple worker processes per machine Uses Apache Zookeeper for coordination, locks, distributed queue Click Insert > Header & Footer

OSP Scheduler Keeps track of our chain of events Reads a configuration file (.ini) When one job has completed, jobs depending on it will be started Tasks written in python Either runs on the scheduler node, or starts a Hadoop job Developed in-house. No good alternative available (when we started) Fault-tolerant and Scalable Components used: Python, ZooKeeper

Coordinating Distributed Systems Apache ZooKeeper

Apache ZooKeeper Because coordinating distributed systems is a Zoo Centralized service for distributed synchronization Really simple API, but can be used to build complex distributed services Provides a simple way of not having to worry about doing things like distributed locks right.

Apache ZooKeeper Overview Filesystem-like data structure Znodes Znodes can have data (up to 1MiB) Znodes can have children Atomic changes Versioning Ephemeral nodes Watches

Apache ZooKeeper Fault-tolerance and performance Needs at least 3 servers for redundancy One server goes down service is still up With 5 servers, 2 servers may go down and service is still up Coordinator is dynamically elected through quorum vote Write performance limited by I/O performance on coordinator node Read-performance is good, reads are local on each server. All servers keep full copy of database. Clients connect to a randomly selected node, writes are forwarded to coordinator node

How OSP uses ZooKeeper Distributed queue Multiple producers, multiple consumers Locks Either a znode is present, or it's not (atomicity) Want to run only one cron job at any time, but on multiple nodes ephemeral node with known name Barriers No polling, uses watches Waiting until all tasks in a chain have completed before some action Configuration data Dynamic reconfig of service by watching known znode

Storing Big Data Apache Cassandra

Apache Cassandra A distributed NoSQL solution Distributed key/value store, but with columns Scalable. Keys distributed among nodes. Just add more hardware. Handles large amounts of data Tunable consistency Configurable replication All nodes in cluster are equal. No master node.

NoSQL I was raised with RDBMS and MySQL, this is different No relations Key/Value But with Columns. A Key can have many many columns Think Different How will data be accessed by the application? Not How should I store data in the most efficient way?

Cassandra Data Model Row keys with sparse columns Row keys are hashed and data is distributed among servers Keyspace replication factor (RF) decides number of servers that hold each key RF > X protects against X servers failing RF also decides read performance Columns are sorted, and can be retrieved in slices

Cassandra Cluster Keyspace1, RF=2 Column Family X Column Family Y Click Insert > Header & Footer Keyspace1, RF=2 Column Family Z Column Family Y

Cassandra data model Sparse columns Row key Home address Work e-mail forsberg The outskirts forsberg@oper a.com carrie NY Downtown sexandthecity @gmail.com ola Lambohov Ola.leifler@liu. se Home e-mail carrie@example.com

Cassandra data model No JOIN, use a second Column Family Row key Username forsberg@opera.com forsberg sexandthecity@gmail.com carrie Ola.leifler@liu.se ola

Cassandra storage Sorted String Tables Writes to Cassandra are to commitlog, then to memory Memory is dumped to sstables periodically (or when they get full) Very fast, scalable writes! Sparse columns are stored efficiently. No null values to store. Columns can have TTL, which makes data go away automatically Sstables are merged automatically over time

How OSP use Cassandra Results of Hadoop calculations are stored in Cassandra i.e. country=sweden,shoesize=43 : 7 unique users for Opera Mini Millions of keys stored every day TTL used to make keys go away after our retention period (i.e. daily data is gone after 6 months)

Choosing the right tool for the job

Hadoop vs Cassandra Hadoop Really good Streaming I/O Slow as a melting iceberg on random access. So not useful as backend for UI. Has M/R component Cassandra Really good response times on random access to any row key. Makes it useful as backend for UI. No data crunching component Hadoop can crunch data from Cassandra efficiently

Cassandra vs ZooKeeper Two very different Key/Value (kind of) systems Cassandra Scalable write performance Can keep large blobs Structured data, keys and columns (can have) data types No atomicity, but tunable consistency ZooKeeper Write performance limited by I/O of one node Small amounts of data (1MiB). Data must fit in Heap Just a blob (bsob?) (we use JSON) Atomicity, ephemeral nodes, watches

How do you choose the right component? KISS Don't be afraid of Open Source. But evaluate the community activeness Evaluate multiple alternatives Choose what seems easiest to work with. If it takes 3 days to set it up for an experiment, it probably isn't right.

The UI

The UI I'm not a UI person, but we have other team members who are Written in PHP Uses Highcharts, a fine Norwegian component for drawing SVG graphs Also uses Memcache for some local caching

Tying it together

Tying it all together You need some kind of glue to build a system We choose python Chances are that if you need to communicate with component X, python already has a module for it Really quick to do development in. Get down to business at once!' At Opera we use a multitude of Languages

End Questions?

We are hiring! http://jobs.opera.com

2013 Opera Software ASA. All rights reserved.