Big Data Analytics with Cassandra, Spark & MLLib

Similar documents

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Moving From Hadoop to Spark

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Unified Big Data Processing with Apache Spark. Matei

Streaming items through a cluster with Spark Streaming

Unified Big Data Analytics Pipeline. 连城

Architectures for massive data management

Big Data Analytics. Lucas Rego Drumond

Real Time Data Processing using Spark Streaming

Beyond Hadoop with Apache Spark and BDAS

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark and the Big Data Library

Scaling Out With Apache Spark. DTL Meeting Slides based on

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

This is a brief tutorial that explains the basics of Spark SQL programming.

Apache Spark and Distributed Programming

Big Data Frameworks: Scala and Spark Tutorial

Spark Application Carousel. Spark Summit East 2015

HiBench Introduction. Carson Wang Software & Services Group

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

MLlib: Scalable Machine Learning on Spark

Certification Study Guide. MapR Certified Spark Developer Study Guide

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Spark. Fast, Interactive, Language- Integrated Cluster Computing

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Brave New World: Hadoop vs. Spark

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Journée Thématique Big Data 13/03/2015

Introduction to Spark

Spark: Making Big Data Interactive & Real-Time

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

The Hadoop Eco System Shanghai Data Science Meetup

Shark Installation Guide Week 3 Report. Ankush Arora

Data Science in the Wild

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Big Data and Scripting Systems build on top of Hadoop

Hadoop Ecosystem B Y R A H I M A.

SEIZE THE DATA SEIZE THE DATA. 2015

How To Create A Data Visualization With Apache Spark And Zeppelin

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop, Hive & Spark Tutorial

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Fast Analytics on Big Data with H20

Introduc8on to Apache Spark

Machine- Learning Summer School

Big Data Analytics Hadoop and Spark

Spark: Cluster Computing with Working Sets

HDFS. Hadoop Distributed File System

Big Data and Scripting Systems beyond Hadoop

Big Data Processing. Patrick Wendell Databricks

Harnessing Big Data with KNIME

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Apache Flink. Fast and Reliable Large-Scale Data Processing

Next-Gen Big Data Analytics using the Spark stack

Introduction To Hive

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

Introduction to Big Data with Apache Spark UC BERKELEY

Introduction to Cassandra

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Apache Kylin Introduction Dec 8,

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

What s Cooking in KNIME

Integrating Big Data into the Computing Curricula

How Companies are! Using Spark

Hadoop & Spark Using Amazon EMR

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Apache Flink Next-gen data analysis. Kostas

Writing Standalone Spark Programs

Openbus Documentation

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

DduP Towards a Deduplication Framework utilising Apache Spark

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Xiaoming Gao Hui Li Thilina Gunarathne

Using Apache Spark Pat McDonough - Databricks

Oracle Database 12c Plug In. Switch On. Get SMART.

Machine Learning. Hands-On for Developers and Technical Professionals

Transcription:

Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff

AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo

CQL QUERYING LANGUAGE WITH LIMITATIONS SELECT * FROM performer WHERE name = 'ACDC à ok SELECT * FROM performer WHERE name = 'ACDC and country = Australia à not ok performer name text K style text country text type text SELECT country, COUNT(*) as quantity FROM artists GROUP BY country ORDER BY quantity DESC à not possible 18.06.15 3

WHAT IS APACHE SPARK? Open Source & Apache project since 2010 Data processing framework Batch processing Stream processing 18.06.15 4

WHY USE SPARK? Fast up to 100 times faster than Hadoop a lot of in-memory processing scalable about nodes Easy Scala, Java and Python API Clean Code (e.g. with lambdas in Java 8) expanded API: map, reduce, filter, groupby, sort, union, join, reducebykey, groupbykey, sample, take, first, count.. Fault-Tolerant easily reproducible 18.06.15 5

EASILY REPRODUCIBLE? RDD s Resilient Distributed Dataset Read Only Collection of Objects Distributed among the Cluster (on memory or disk) Determined through transformations Automatically rebuild on failure Operations Transformations (map,filter,reduce...)! new RDD Actions (count, collect, save) Only Actions start processing! 18.06.15 6

RDD EXAMPLE scala> val textfile = sc.textfile("readme.md") textfile: spark.rdd[string] = spark.mappedrdd@2ee9b6e3 scala> val lineswithspark = textfile.filter(line => line.contains("spark")) lineswithspark: spark.rdd[string] = spark.filteredrdd@7dd4af09 scala> lineswithspark.count() res0: Long = 126 18.06.15 7

REPRODUCE RDD USING A TREE Datenquelle filter(..) rdd1 map(..) count() rdd2 rdd4 rdd3 union(..) cache() sample(..) val1 count() rdd5 val2 rdd6 val3 count() 18.06.15 8

SPARK ON CASSANDRA Spark Cassandra Connector by Datastax https://github.com/datastax/spark-cassandra-connector Cassandra tables as Spark RDD (read & write) Mapping of C* tables and rows onto Java/Scala objects Server-Side filtering ( where ) Compatible with Spark 0.9 Cassandra 2.0 Clone & Compile with SBT or download at Maven Central 18.06.15 9

USE SPARK CONNECTOR Start Spark Shell bin/spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0- SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost --driver-memory 2g --executor-memory 3g Import Cassandra Classes scala> import com.datastax.spark.connector._, 18.06.15 10

READ TABLE Read complete table val movies = sc.cassandratable("movie", "movies") // Return type: CassandraRDD[CassandraRow] Read selected columns val movies = sc.cassandratable("movie", "movies").select("title","year") Filter Rows val movies = sc.cassandratable("movie", "movies").where("title = 'Die Hard'") 18.06.15 11

READ TABLE Access columns in ResultSet getstring, getstringoption, getorelse, getmap, movies.collect.foreach(r => println(r.get[string]("title"))) As Scala tuple sc.cassandratable[(string,int)]("movie","movies").select("title","year") or sc.cassandratable("movie","movies").select("title","year").as((_: String, _:Int))!CassandraRDD[(String,Int)] 18.06.15 12

READ TABLE CASE CLASS case class Movie(name: String, year: Int) sc.cassandratable[movie]("movie","movies").select("title","year") sc.cassandratable("movie","movies").select("title","year").as(movie)!cassandrardd[movie] 18.06.15 13

WRITE TABLE Every RDD can be saved (not only CassandraRDD) Tuple val tuples = sc.parallelize(seq(("hobbit",2012), ("96 Hours",2008))) tuples.savetocassandra("movie","movies", SomeColumns("title","year") - Case Class case class Movie (title:string, year: int) val objects = sc.parallelize(seq( Movie("Hobbit",2012), Movie("96 Hours",2008))) objects.savetocassandra("movie","movies") 18.06.15 14

PAIR RDD S ( [atomic,collection,object], [atomic,collection,object]) key not unique value val fluege= List( ("Thomas", "Berlin"),("Mark", "Paris"), ("Thomas", "Madrid") ) val pairrdd = sc.parallelize(fluege) pairrdd.filter(_._1 == "Thomas").collect.foreach(t => println(t._1 + " flog nach " + t._2)) 18.06.15 15

WHY USE PAIR RDD'S? Parallelization! keys are use for partitioning pairs with different keys are distributed across the cluster Efficient processing of aggregate by key group by key sort by key joins, union based on keys 18.06.15 16

SPECIAL OP S FOR PAIR RDD S keys(), values() mapvalues(func), flatmapvalues(func) lookup(key), collectasmap(), countbykey() reducebykey(func), foldbykey(zerovalue)(func) groupbykey(), cogroup(otherdataset) sortbykey([ascending]) join(otherdataset), leftouterjoin(otherdataset), rightouterjoin(otherdataset) union(otherdataset), substractbykey(otherdataset) 18.06.15 17

CASSANDRA EXAMPLE val pairrdd = sc.cassandratable("movie","director").map(r => (r.getstring("country"),r)) // Directors / Country, sorted pairrdd.mapvalues(v => 1).reduceByKey(_+_).sortBy(- _._2).collect.foreach(println) // or, unsorted pairrdd.countbykey().foreach(println) // All Countries pairrdd.keys() director name text K country text 18.06.15 18

CASSANDRA EXAMPLE pairrdd.groupbycountry()!rdd[(string,iterable[cassandrarow])] director name text K country text val directors = sc.cassandratable(..).map(r => r.getstring("name"),r)) val movies = sc.cassandratable().map(r => r.getstring("director"),r)) movie title text K directors.cogroup(movies) director text RDD[(String, Director Movies (Iterable[CassandraRow], Iterable[CassandraRow]))] 18.06.15 19

CASSANDRA EXAMPLE - JOINS Joins could be expensive partitions for same keys in different tables on different nodes requires shuffling movie title text K director text director name text K country text val directors = sc.cassandratable(..).map(r => (r.getstring("name"),r)) val movies = sc.cassandratable().map(r => (r.getstring("director"),r)) movies.join(directors)! RDD[(String, (CassandraRow, CassandraRow))] Movie Director 18.06.15 20

USE CASES Data Loading Validation & Normalizatio n Analyses (Joins, Transformati ons,..) Schema Migration Data Conversion 18.06.15 21

DATA LOADING In particular for huge amounts of external data Support for CSV, TSV, XML, JSON und other Example: case class User (id: java.util.uuid, name: String) val users = sc.textfile("users.csv").repartition(2*sc.defaultparallelism).map(line => line.split(",") match { case Array(id,name) => User(java.util.UUID.fromString(id), name)}) users.savetocassandra("keyspace","users") 18.06.15 22

VALIDATION Validate consistency in a Cassandra Database syntactic Uniqueness (only relevant for columns not in the PK) Referential integrity Integrity of the duplicates semantic Business- or Application constraints e.g.: At least one genre per movies, a maximum of 10 tags per blog post 18.06.15 23

ANALYSES Modelling, Mining, Transforming,... Use Cases Recommendation Fraud Detection Link Analysis (Social Networks, Web) Advertising Data Stream Analytics (! Spark Streaming) Machine Learning (! Spark ML) 18.06.15 24

SCHEMA MIGRATION & DATA CONVERSION Changes on existing tables New table required when changing primary key Otherwise changes could be performed in-place Creating new tables data derived from existing tables Support new queries Use the CassandraConnectors in Spark val cc = CassandraConnector(sc.getConf) cc.withsessiondo{session => session.execute(alter TABLE movie.movies ADD year timestamp} 18.06.15 25

STREAM PROCESSING.. with Spark Streaming Real Time Processing Supported sources: TCP, HDFS, S3, Kafka, Twitter,.. Data as Discretized Stream (DStream) All Operations of the GenericRDD Stateful Operations Sliding Windows 18.06.15 26

SPARK STREAMING - BEISPIEL val ssc = new StreamingContext(sc,Seconds(1)) val stream = ssc.sockettextstream("127.0.0.1",9999) stream.map(x => 1).reduce(_ + _).print() ssc.start() // await manual termination or error ssc.awaittermination() // manual termination ssc.stop() 18.06.15 27

SPARK SQL SQL Queries with Spark (SQL & HiveQL) On structured data On SchemaRDD s Every result of Spark SQL are SchemaRDD s All operations of the GenericRDD s available Unterstützt, auch auf nicht PK Spalten,... Joins Union Group By Having Order By 18.06.15 28

SPARK SQL CASSANDRA BEISPIEL val csc = new CassandraSQLContext(sc) csc.setkeyspace("musicdb") performer name text K style text country text type text val result = csc.sql("select country, COUNT(*) as anzahl" + "FROM artists GROUP BY country" + "ORDER BY anzahl DESC"); result.collect.foreach(println); 18.06.15 29

SPARK SQL RDD / JSON BEISPIEL val sqlcontext = new SQLContext(sc) val personen = sqlcontext.jsonfile(path) // Show the schema personen.printschema() {"name":"michael"} {"name":"jan", "alter":30} {"name":"tim", "alter":17} personen.registertemptable("personen") val erwachsene = sqlcontext.sql("select name FROM personen WHERE alter > 18") erwachsene.collect.foreach(println) 18.06.15 30

SPARK MLLIB Fully integrated in Spark Scalable Scala, Java & Python APIs Use with Spark Streaming & Spark SQL Packages various algorithms for machine learning Includes Clustering Classification Prediction Collaborative Filtering Still under development performance, algorithms 18.06.15 31

EXAMPLE CLUSTERING set of data points meaningful clusters income age 18.06.15 32

EXAMPLE CLUSTERING (K-MEANS) // Load and parse data val data = sc.textfile("data/mllib/kmeans_data.txt") val parseddata = data.map(s => Vectors.dense( s.split(' ').map(_.todouble))).cache() // Cluster the data into 3 classes using KMeans with 20 iterations val clusters = KMeans.train(parsedData, 2, 20) // Evaluate clustering by computing Sum of Squared Errors val SSE = clusters.computecost(parseddata) println("sum of Squared Errors = " + WSSSE) 18.06.15 33

EXAMPLE CLASSIFICATION 18.06.15 34

EXAMPLE CLASSIFICATION 18.06.15 35

EXAMPLE CLASSIFICATION (LINEAR SVM) // Load training data in LIBSVM format. val data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt") // Split data into training (60%) and test (40%). val splits = data.randomsplit(array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1) // Run training algorithm to build the model val numiterations = 100 val model = SVMWithSGD.train(training, numiterations) 18.06.15 36

EXAMPLE CLASSIFICATION (LINEAR SVM) // Compute raw scores on the test set. val scoreandlabels = test.map { point => } val score = model.predict(point.features) (score, point.label) // Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auroc = metrics.areaunderroc() println("area under ROC = " + auroc) 18.06.15 37

COLLABORATIVE FILTERING 18.06.15 38

COLLABORATIVE FILTERING (ALS) // Load and parse the data (userid,itemid,rating) val data = sc.textfile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toint, rate.todouble) }) // Build the recommendation model using ALS val rank = 10 val numiterations = 20 val model = ALS.train(ratings, rank, numiterations, 0.01) 18.06.15 39

COLLABORATIVE FILTERING (ALS) // Load and parse the data (userid,itemid,rating) val data = sc.textfile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toint, rate.todouble) }) // Build the recommendation model using ALS val rank = 10 val numiterations = 20 val model = ALS.train(ratings, rank, numiterations, 0.01) 18.06.15 40

COLLABORATIVE FILTERING (ALS) // Evaluate the model on rating data val usersproducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersproducts).map { case Rating(user, product, rate) => ((user, product), rate) } val ratesandpreds = ratings.map { case Rating(user, product, rate) => ((user, product), rate)}.join(predictions) val MSE = ratesandpreds.map { case ((user, product), (r1, r2)) => val err = (r1 - r2); err * err }.mean() println("mean Squared Error = " + MSE) 18.06.15 41

CURIOUS? GETTING STARTED! Setup, Codes & Examples on Github https://github.com/mniehoff/spark-cassandra-playground Install Cassandra For local experiments: Cassandra Cluster Manager https://github.com/pcmanus/ccm ccm create test -v binary:2.0.5 -n 3 s ccm status ccm node1 status Install Spark Tar Ball (Pre Built for Hadoop 2.6) HomeBrew MacPorts 18.06.15 42

CURIOUS? GETTING STARTED! Spark Cassandra Connector https://github.com/datastax/spark-cassandra-connector Clone,!./sbt/sbt assembly or Download Jar @ Maven Central Start Spark shell spark-shell --jars ~/path/to/jar/spark-cassandraconnector-assembly-1.3.0-snapshot.jar --conf spark.cassandra.connection.host=localhost Import Cassandra classes scala> import com.datastax.spark.connector._, 18.06.15 43

DATASTAX ENTERPRISE EDITION Production ready Cassandra Integrates Spark Hadoop Solr OpsCenter & DevCenter Bundled and ready to go Free for development and test http://www.datastax.com/download (One time registration required) 18.06.15 44

QUESTIONS? 18.06.15 45

KONTAKT Matthias Niehoff Zeppelinstraße 2 76185 Karlsruhe tel +49 (0) 721 95 95 681 matthias.niehoff@codecentric.de 18.06.15 46