Big Data Analytics with Cassandra, Spark & MLLib
|
|
|
- Oswin Park
- 9 years ago
- Views:
Transcription
1 Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff
2 AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo
3 CQL QUERYING LANGUAGE WITH LIMITATIONS SELECT * FROM performer WHERE name = 'ACDC à ok SELECT * FROM performer WHERE name = 'ACDC and country = Australia à not ok performer name text K style text country text type text SELECT country, COUNT(*) as quantity FROM artists GROUP BY country ORDER BY quantity DESC à not possible
4 WHAT IS APACHE SPARK? Open Source & Apache project since 2010 Data processing framework Batch processing Stream processing
5 WHY USE SPARK? Fast up to 100 times faster than Hadoop a lot of in-memory processing scalable about nodes Easy Scala, Java and Python API Clean Code (e.g. with lambdas in Java 8) expanded API: map, reduce, filter, groupby, sort, union, join, reducebykey, groupbykey, sample, take, first, count.. Fault-Tolerant easily reproducible
6 EASILY REPRODUCIBLE? RDD s Resilient Distributed Dataset Read Only Collection of Objects Distributed among the Cluster (on memory or disk) Determined through transformations Automatically rebuild on failure Operations Transformations (map,filter,reduce...)! new RDD Actions (count, collect, save) Only Actions start processing!
7 RDD EXAMPLE scala> val textfile = sc.textfile("readme.md") textfile: spark.rdd[string] = spark.mappedrdd@2ee9b6e3 scala> val lineswithspark = textfile.filter(line => line.contains("spark")) lineswithspark: spark.rdd[string] = spark.filteredrdd@7dd4af09 scala> lineswithspark.count() res0: Long =
8 REPRODUCE RDD USING A TREE Datenquelle filter(..) rdd1 map(..) count() rdd2 rdd4 rdd3 union(..) cache() sample(..) val1 count() rdd5 val2 rdd6 val3 count()
9 SPARK ON CASSANDRA Spark Cassandra Connector by Datastax Cassandra tables as Spark RDD (read & write) Mapping of C* tables and rows onto Java/Scala objects Server-Side filtering ( where ) Compatible with Spark 0.9 Cassandra 2.0 Clone & Compile with SBT or download at Maven Central
10 USE SPARK CONNECTOR Start Spark Shell bin/spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost --driver-memory 2g --executor-memory 3g Import Cassandra Classes scala> import com.datastax.spark.connector._,
11 READ TABLE Read complete table val movies = sc.cassandratable("movie", "movies") // Return type: CassandraRDD[CassandraRow] Read selected columns val movies = sc.cassandratable("movie", "movies").select("title","year") Filter Rows val movies = sc.cassandratable("movie", "movies").where("title = 'Die Hard'")
12 READ TABLE Access columns in ResultSet getstring, getstringoption, getorelse, getmap, movies.collect.foreach(r => println(r.get[string]("title"))) As Scala tuple sc.cassandratable[(string,int)]("movie","movies").select("title","year") or sc.cassandratable("movie","movies").select("title","year").as((_: String, _:Int))!CassandraRDD[(String,Int)]
13 READ TABLE CASE CLASS case class Movie(name: String, year: Int) sc.cassandratable[movie]("movie","movies").select("title","year") sc.cassandratable("movie","movies").select("title","year").as(movie)!cassandrardd[movie]
14 WRITE TABLE Every RDD can be saved (not only CassandraRDD) Tuple val tuples = sc.parallelize(seq(("hobbit",2012), ("96 Hours",2008))) tuples.savetocassandra("movie","movies", SomeColumns("title","year") - Case Class case class Movie (title:string, year: int) val objects = sc.parallelize(seq( Movie("Hobbit",2012), Movie("96 Hours",2008))) objects.savetocassandra("movie","movies")
15 PAIR RDD S ( [atomic,collection,object], [atomic,collection,object]) key not unique value val fluege= List( ("Thomas", "Berlin"),("Mark", "Paris"), ("Thomas", "Madrid") ) val pairrdd = sc.parallelize(fluege) pairrdd.filter(_._1 == "Thomas").collect.foreach(t => println(t._1 + " flog nach " + t._2))
16 WHY USE PAIR RDD'S? Parallelization! keys are use for partitioning pairs with different keys are distributed across the cluster Efficient processing of aggregate by key group by key sort by key joins, union based on keys
17 SPECIAL OP S FOR PAIR RDD S keys(), values() mapvalues(func), flatmapvalues(func) lookup(key), collectasmap(), countbykey() reducebykey(func), foldbykey(zerovalue)(func) groupbykey(), cogroup(otherdataset) sortbykey([ascending]) join(otherdataset), leftouterjoin(otherdataset), rightouterjoin(otherdataset) union(otherdataset), substractbykey(otherdataset)
18 CASSANDRA EXAMPLE val pairrdd = sc.cassandratable("movie","director").map(r => (r.getstring("country"),r)) // Directors / Country, sorted pairrdd.mapvalues(v => 1).reduceByKey(_+_).sortBy(- _._2).collect.foreach(println) // or, unsorted pairrdd.countbykey().foreach(println) // All Countries pairrdd.keys() director name text K country text
19 CASSANDRA EXAMPLE pairrdd.groupbycountry()!rdd[(string,iterable[cassandrarow])] director name text K country text val directors = sc.cassandratable(..).map(r => r.getstring("name"),r)) val movies = sc.cassandratable().map(r => r.getstring("director"),r)) movie title text K directors.cogroup(movies) director text RDD[(String, Director Movies (Iterable[CassandraRow], Iterable[CassandraRow]))]
20 CASSANDRA EXAMPLE - JOINS Joins could be expensive partitions for same keys in different tables on different nodes requires shuffling movie title text K director text director name text K country text val directors = sc.cassandratable(..).map(r => (r.getstring("name"),r)) val movies = sc.cassandratable().map(r => (r.getstring("director"),r)) movies.join(directors)! RDD[(String, (CassandraRow, CassandraRow))] Movie Director
21 USE CASES Data Loading Validation & Normalizatio n Analyses (Joins, Transformati ons,..) Schema Migration Data Conversion
22 DATA LOADING In particular for huge amounts of external data Support for CSV, TSV, XML, JSON und other Example: case class User (id: java.util.uuid, name: String) val users = sc.textfile("users.csv").repartition(2*sc.defaultparallelism).map(line => line.split(",") match { case Array(id,name) => User(java.util.UUID.fromString(id), name)}) users.savetocassandra("keyspace","users")
23 VALIDATION Validate consistency in a Cassandra Database syntactic Uniqueness (only relevant for columns not in the PK) Referential integrity Integrity of the duplicates semantic Business- or Application constraints e.g.: At least one genre per movies, a maximum of 10 tags per blog post
24 ANALYSES Modelling, Mining, Transforming,... Use Cases Recommendation Fraud Detection Link Analysis (Social Networks, Web) Advertising Data Stream Analytics (! Spark Streaming) Machine Learning (! Spark ML)
25 SCHEMA MIGRATION & DATA CONVERSION Changes on existing tables New table required when changing primary key Otherwise changes could be performed in-place Creating new tables data derived from existing tables Support new queries Use the CassandraConnectors in Spark val cc = CassandraConnector(sc.getConf) cc.withsessiondo{session => session.execute(alter TABLE movie.movies ADD year timestamp}
26 STREAM PROCESSING.. with Spark Streaming Real Time Processing Supported sources: TCP, HDFS, S3, Kafka, Twitter,.. Data as Discretized Stream (DStream) All Operations of the GenericRDD Stateful Operations Sliding Windows
27 SPARK STREAMING - BEISPIEL val ssc = new StreamingContext(sc,Seconds(1)) val stream = ssc.sockettextstream(" ",9999) stream.map(x => 1).reduce(_ + _).print() ssc.start() // await manual termination or error ssc.awaittermination() // manual termination ssc.stop()
28 SPARK SQL SQL Queries with Spark (SQL & HiveQL) On structured data On SchemaRDD s Every result of Spark SQL are SchemaRDD s All operations of the GenericRDD s available Unterstützt, auch auf nicht PK Spalten,... Joins Union Group By Having Order By
29 SPARK SQL CASSANDRA BEISPIEL val csc = new CassandraSQLContext(sc) csc.setkeyspace("musicdb") performer name text K style text country text type text val result = csc.sql("select country, COUNT(*) as anzahl" + "FROM artists GROUP BY country" + "ORDER BY anzahl DESC"); result.collect.foreach(println);
30 SPARK SQL RDD / JSON BEISPIEL val sqlcontext = new SQLContext(sc) val personen = sqlcontext.jsonfile(path) // Show the schema personen.printschema() {"name":"michael"} {"name":"jan", "alter":30} {"name":"tim", "alter":17} personen.registertemptable("personen") val erwachsene = sqlcontext.sql("select name FROM personen WHERE alter > 18") erwachsene.collect.foreach(println)
31 SPARK MLLIB Fully integrated in Spark Scalable Scala, Java & Python APIs Use with Spark Streaming & Spark SQL Packages various algorithms for machine learning Includes Clustering Classification Prediction Collaborative Filtering Still under development performance, algorithms
32 EXAMPLE CLUSTERING set of data points meaningful clusters income age
33 EXAMPLE CLUSTERING (K-MEANS) // Load and parse data val data = sc.textfile("data/mllib/kmeans_data.txt") val parseddata = data.map(s => Vectors.dense( s.split(' ').map(_.todouble))).cache() // Cluster the data into 3 classes using KMeans with 20 iterations val clusters = KMeans.train(parsedData, 2, 20) // Evaluate clustering by computing Sum of Squared Errors val SSE = clusters.computecost(parseddata) println("sum of Squared Errors = " + WSSSE)
34 EXAMPLE CLASSIFICATION
35 EXAMPLE CLASSIFICATION
36 EXAMPLE CLASSIFICATION (LINEAR SVM) // Load training data in LIBSVM format. val data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt") // Split data into training (60%) and test (40%). val splits = data.randomsplit(array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1) // Run training algorithm to build the model val numiterations = 100 val model = SVMWithSGD.train(training, numiterations)
37 EXAMPLE CLASSIFICATION (LINEAR SVM) // Compute raw scores on the test set. val scoreandlabels = test.map { point => } val score = model.predict(point.features) (score, point.label) // Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auroc = metrics.areaunderroc() println("area under ROC = " + auroc)
38 COLLABORATIVE FILTERING
39 COLLABORATIVE FILTERING (ALS) // Load and parse the data (userid,itemid,rating) val data = sc.textfile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toint, rate.todouble) }) // Build the recommendation model using ALS val rank = 10 val numiterations = 20 val model = ALS.train(ratings, rank, numiterations, 0.01)
40 COLLABORATIVE FILTERING (ALS) // Load and parse the data (userid,itemid,rating) val data = sc.textfile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toint, rate.todouble) }) // Build the recommendation model using ALS val rank = 10 val numiterations = 20 val model = ALS.train(ratings, rank, numiterations, 0.01)
41 COLLABORATIVE FILTERING (ALS) // Evaluate the model on rating data val usersproducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersproducts).map { case Rating(user, product, rate) => ((user, product), rate) } val ratesandpreds = ratings.map { case Rating(user, product, rate) => ((user, product), rate)}.join(predictions) val MSE = ratesandpreds.map { case ((user, product), (r1, r2)) => val err = (r1 - r2); err * err }.mean() println("mean Squared Error = " + MSE)
42 CURIOUS? GETTING STARTED! Setup, Codes & Examples on Github Install Cassandra For local experiments: Cassandra Cluster Manager ccm create test -v binary: n 3 s ccm status ccm node1 status Install Spark Tar Ball (Pre Built for Hadoop 2.6) HomeBrew MacPorts
43 CURIOUS? GETTING STARTED! Spark Cassandra Connector Clone,!./sbt/sbt assembly or Download Maven Central Start Spark shell spark-shell --jars ~/path/to/jar/spark-cassandraconnector-assembly snapshot.jar --conf spark.cassandra.connection.host=localhost Import Cassandra classes scala> import com.datastax.spark.connector._,
44 DATASTAX ENTERPRISE EDITION Production ready Cassandra Integrates Spark Hadoop Solr OpsCenter & DevCenter Bundled and ready to go Free for development and test (One time registration required)
45 QUESTIONS?
46 KONTAKT Matthias Niehoff Zeppelinstraße Karlsruhe tel +49 (0) [email protected]
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Streaming items through a cluster with Spark Streaming
Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
Real Time Data Processing using Spark Streaming
Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta [email protected] Radu Chilom [email protected] Buzzwords Berlin - 2015 Big data analytics / machine
This is a brief tutorial that explains the basics of Spark SQL programming.
About the Tutorial Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types
Apache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
Big Data Frameworks: Scala and Spark Tutorial
Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides: http://is.gd/bigdatascala www.cs.helsinki.fi Functional Programming Functional
Spark Application Carousel. Spark Summit East 2015
Spark Application Carousel Spark Summit East 2015 About Today s Talk About Me: Vida Ha - Solutions Engineer at Databricks. Goal: For beginning/early intermediate Spark Developers. Motivate you to start
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
MLlib: Scalable Machine Learning on Spark
MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph
Certification Study Guide. MapR Certified Spark Developer Study Guide
Certification Study Guide MapR Certified Spark Developer Study Guide 1 CONTENTS About MapR Study Guides... 3 MapR Certified Spark Developer (MCSD)... 3 SECTION 1 WHAT S ON THE EXAM?... 5 1. Load and Inspect
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
SPARK USE CASE IN TELCO. Apache Spark Night 9-2-2014! Chance Coble!
SPARK USE CASE IN TELCO Apache Spark Night 9-2-2014! Chance Coble! Use Case Profile Telecommunications company Shared business problems/pain Scalable analytics infrastructure is a problem Pushing infrastructure
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
Journée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
Introduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
Spark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
http://glennengstrand.info/analytics/fp
Functional Programming and Big Data by Glenn Engstrand (September 2014) http://glennengstrand.info/analytics/fp What is Functional Programming? It is a style of programming that emphasizes immutable state,
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg
HDB++: HIGH AVAILABILITY WITH Page 1 OVERVIEW What is Cassandra (C*)? Who is using C*? CQL C* architecture Request Coordination Consistency Monitoring tool HDB++ Page 2 OVERVIEW What is Cassandra (C*)?
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
Shark Installation Guide Week 3 Report. Ankush Arora
Shark Installation Guide Week 3 Report Ankush Arora Last Updated: May 31,2014 CONTENTS Contents 1 Introduction 1 1.1 Shark..................................... 1 1.2 Apache Spark.................................
Data Science in the Wild
Data Science in the Wild Lecture 4 59 Apache Spark 60 1 What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General
Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data
Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
SEIZE THE DATA. 2015 SEIZE THE DATA. 2015
1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. BIG DATA CONFERENCE 2015 Boston August 10-13 Predicting and reducing deforestation
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP
LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP ICTP, Trieste, March 24th 2015 The objectives of this session are: Understand the Spark RDD programming model Familiarize with
1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2
SparkR 1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane SparkR 2 Lecture slides and/or video will be made available within one week Live Demonstration
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
Hadoop, Hive & Spark Tutorial
Hadoop, Hive & Spark Tutorial 1 Introduction This tutorial will cover the basic principles of Hadoop MapReduce, Apache Hive and Apache Spark for the processing of structured datasets. For more information
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera
Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,
Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify
Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Introduc8on to Apache Spark
Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these
Machine- Learning Summer School - 2015
Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Big Data Processing. Patrick Wendell Databricks
Big Data Processing Patrick Wendell Databricks About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks
Harnessing Big Data with KNIME
Harnessing Big Data with KNIME Tobias Kötter KNIME.com Agenda The three V s of Big Data Big Data Extension and Databases Nodes Demo 2 Variety, Volume, Velocity Variety: integrating heterogeneous data (and
Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015
Leveraging the Power of SOLR with SPARK Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015 Welcome Johannes Weigend - CTO QAware GmbH - Software architect / developer - 25 years
Apache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
Next-Gen Big Data Analytics using the Spark stack
Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our
Introduction To Hive
Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The
E6895 Advanced Big Data Analytics Lecture 4:! Data Store
E6895 Advanced Big Data Analytics Lecture 4:! Data Store Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics,
Introduction to Big Data with Apache Spark UC BERKELEY
Introduction to Big Data with Apache Spark UC BERKELEY This Lecture Programming Spark Resilient Distributed Datasets (RDDs) Creating an RDD Spark Transformations and Actions Spark Programming Model Python
Introduction to Cassandra
Introduction to Cassandra DuyHai DOAN, Technical Advocate Agenda! Architecture cluster replication Data model last write win (LWW), CQL basics (CRUD, DDL, collections, clustering column) lightweight transactions
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage
Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Apache Kylin Introduction Dec 8, 2014 @ApacheKylin
Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Luke Han Sr. Product Manager [email protected] @lukehq Yang Li Architect & Tech Leader [email protected] Agenda What s Apache Kylin? Tech Highlights Performance
Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
What s Cooking in KNIME
What s Cooking in KNIME Thomas Gabriel Copyright 2015 KNIME.com AG Agenda Querying NoSQL Databases Database Improvements & Big Data Copyright 2015 KNIME.com AG 2 Querying NoSQL Databases MongoDB & CouchDB
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, 3.1.014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Writing Standalone Spark Programs
Writing Standalone Spark Programs Matei Zaharia UC Berkeley www.spark- project.org UC BERKELEY Outline Setting up for Spark development Example: PageRank PageRank in Java Testing and debugging Building
Openbus Documentation
Openbus Documentation Release 1 Produban February 17, 2014 Contents i ii An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents:
Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015
Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours
DduP Towards a Deduplication Framework utilising Apache Spark
DduP Towards a Deduplication Framework utilising Apache Spark Niklas Wilcke Datenbanken und Informationssysteme (ISYS) University of Hamburg Vogt-Koelln-Strasse 30 22527 Hamburg [email protected]
Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us
DATA INTELLIGENCE FOR ALL Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO Agenda 1. Challenges & Motivation 2. DDF Overview 3. DDF Design
Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig
Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types
Xiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
Using Apache Spark Pat McDonough - Databricks
Using Apache Spark Pat McDonough - Databricks Apache Spark spark.incubator.apache.org github.com/apache/incubator-spark [email protected] The Spark Community +You! INTRODUCTION TO APACHE
Oracle Database 12c Plug In. Switch On. Get SMART.
Oracle Database 12c Plug In. Switch On. Get SMART. Duncan Harvey Head of Core Technology, Oracle EMEA March 2015 Safe Harbor Statement The following is intended to outline our general product direction.
Machine Learning. Hands-On for Developers and Technical Professionals
Brochure More information from http://www.researchandmarkets.com/reports/2785739/ Machine Learning. Hands-On for Developers and Technical Professionals Description: Dig deep into the data with a hands-on
