StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Similar documents

Unified Big Data Analytics Pipeline. 连城

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Ecosystem B Y R A H I M A.

Unified Big Data Processing with Apache Spark. Matei

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Spark use case at Telefonica CBS

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Conquering Big Data with BDAS (Berkeley Data Analytics)

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

How Companies are! Using Spark

Moving From Hadoop to Spark

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Beyond Hadoop with Apache Spark and BDAS

Apache HBase. Crazy dances on the elephant back

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Spark: Making Big Data Interactive & Real-Time

Big Data Analytics Nokia

How To Create A Data Visualization With Apache Spark And Zeppelin

In Memory Accelerator for MongoDB

Big Data Research in the AMPLab: BDAS and Beyond

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

The Internet of Things and Big Data: Intro

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

How To Handle Big Data With A Data Scientist

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Ali Ghodsi Head of PM and Engineering Databricks

How To Scale Out Of A Nosql Database

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Architectures for massive data management

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Big Data Processing. Patrick Wendell Databricks

Big Data With Hadoop

Scaling Out With Apache Spark. DTL Meeting Slides based on

Cloud Scale Distributed Data Storage. Jürmo Mehine

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Chase Wu New Jersey Ins0tute of Technology

Next-Gen Big Data Analytics using the Spark stack

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

BIG DATA & DATA SCIENCE

Oracle Big Data SQL Technical Update

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Analytics on Spark &

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Introduction to Big Data Training

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Brave New World: Hadoop vs. Spark

Simba Apache Cassandra ODBC Driver

Big Data Technology Core Hadoop: HDFS-YARN Internals

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Challenges for Data Driven Systems

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Hadoop and Map-Reduce. Swati Gore

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Oracle Big Data Fundamentals Ed 1 NEW

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Big Data and Scripting Systems build on top of Hadoop

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Streaming items through a cluster with Spark Streaming

Putting Apache Kafka to Use!

Reference Architecture, Requirements, Gaps, Roles

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Apache Sentry. Prasad Mujumdar

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Client Overview. Engagement Situation. Key Requirements

NoSQL Database Options

BIG DATA What it is and how to use?

Big Data and Market Surveillance. April 28, 2014

Internals of Hadoop Application Framework and Distributed File System

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Conquering Big Data with Apache Spark

Dominik Wagenknecht Accenture

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

How To Use Big Data For Telco (For A Telco)

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Data Modeling for Big Data

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Hadoop IST 734 SS CHUNG

Unified Batch & Stream Processing Platform

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Big Data Analytics - Accelerated. stream-horizon.com

BIG DATA IN BUSINESS ENVIRONMENT

A Distributed Network Security Analysis System Based on Apache Hadoop-Related Technologies. Jeff Springer, Mehmet Gunes, George Bebis

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Transcription:

StratioDeep An integration layer between Cassandra and Spark Álvaro Agea Herradón Antonio Alcocer Falcón

StratioDeep An integration layer between Cassandra and Spark Álvaro Agea Herradón Antonio Alcocer Falcón

SORRY

SDS

OUR CLIENTS

STRATIO DEEP An efficient data mining solution Two and two are four? Sometimes Sometimes they are five. G. Orwell

GOALS 1. Why do you need Cassandra? 2. What is the problem? 3. Why do you need Spark? 4. How do they work together?

CASSANDRA 1. Based on DynamoDB 2. Replication, Key/Value, P2P 3. Based on Big Table 4. Column oriented

CASSANDRA 1. Based on DynamoDB 2. Replication, Key/Value, P2P 3. Based on Big Table 4. Column oriented

THE RING Each Cassandra server is assigned a unique token that determines what keys it is the first replica for.

CONSISTENCY Cassandra allows clients to specify the desired consistency level on reads and writes: ONE, QUORUM, ALL... Cassandra repairs missing data in three ways: 1. HintedHandoff 2. ReadRepair

CONSISTENCY Cassandra allows clients to specify the desired consistency level on reads and writes: ONE, QUORUM, ALL... Cassandra repairs missing data in three ways: 1. HintedHandoff 2. ReadRepair

CONSISTENCY Cassandra allows clients to specify the desired consistency level on reads and writes: ONE, QUORUM, ALL... Cassandra repairs missing data in three ways: 1. HintedHandoff 2. ReadRepair

CONSISTENCY Cassandra allows clients to specify the desired consistency level on reads and writes: ONE, QUORUM, ALL... Cassandra repairs missing data in three ways: 1. HintedHandoff 2. ReadRepair

CONSISTENCY Cassandra allows clients to specify the desired consistency level on reads and writes: ONE, QUORUM, ALL... Cassandra repairs missing data in three ways: 1. HintedHandoff 2. ReadRepair

2i: SECONDARY INDEX Cassandra allows clients to create secondary index. You can search information in other column not only in the primary key. Stratio has developed a new secondary index technology based on Apache Lucene: - Full text search - Full suport to logic operator (AND, OR )

2i: SECONDARY INDEX Cassandra allows clients to create secondary index. You can search information in other column not only in the primary key. Stratio has developed a new secondary index technology based on Apache Lucene: - Full text search - Full suport to logic operator (AND, OR )

YET ANOTHER DATABASE?

WHY?

WHY?

BIG DATA USE CASE A FEW USERS / HIGH VOLUME OF DATA

BIG DATA USE CASE B MANY USERS WITH LOW VOLUME OF DATA

BIG DATA USE CASE C MANY USERS WITH HIGH VOLUME OF DATA

BUT

WHAT IS THE PROBLEM? 1. In Cassandra, you need to design the schema with the query in mind 2. Every other type of query is either very inefficient or impossible to resolve

LET S GET DIRTY

WHAT OPTIONS DO WE HAVE? 1. Run Hive Query on top of C* 2. Write an ETL script and load data into another DB 3. Clone the cluster

HIVE + C * 1. We can use CassandraStorageHandler 2. Create an external table in hive to use the data in Cassandra. 3. Queries are resolved in two phases: Load data to Hadoop. Run the HIVE batch process. 4. This solution DOES NOT WORK

HIVE + C * 1. We can use CassandraStorageHandler 2. Create an external table in hive to use the data in Cassandra. 3. Queries are resolved in two phases: Load data to Hadoop. Run the HIVE batch process. 4. This solution DOES NOT WORK

Will still flush cache on cassandra nodes LOAD DATA INTO ANOTHER DB 1. We need to carefully select the other DB: Scalability 2. Prepare an ad-hoc data migration script 3. It s a waste of time: Needs duplicated hardware. Single threaded Unreliable

Will still flush cache on cassandra nodes LOAD DATA INTO ANOTHER DB 1. We need to carefully select the other DB: Scalability 2. Prepare an ad-hoc data migration script 3. It s a waste of time: Needs duplicated hardware. Single threaded Unreliable

CLONE THE CLUSTER We immediately discard this choice Why? 1. Worst possible network load 2. Manual import each time. 3. No incremental update 4. Need duplicate hardware.

WHAT OPTIONS DO WE HAVE? Run Hive Query on top of C* Write ETL scripts and load into another DB Clone the cluster

AND NOW WHAT CAN WE DO? We can't solve problems by using the same kind of thinking we used when we created them Albert Einstein

CHALLENGE ACCEPTED

SPARK

SPARK

SPARK 1. Alternative to MapReduce 2. A low latency cluster computing system 3. Designed for very large datasets 4. Createed by UC Berkeley AMP-Lab in 2010. 5. Up to 100 times faster than MapReduce for: Interactive algorithms. Interactive data mining

SPARK 1. Alternative to MapReduce 2. A low latency cluster computing system 3. Designed for very large datasets 4. Createed by UC Berkeley AMP-Lab in 2010. 5. Up to 100 times faster than MapReduce for: Interactive algorithms. Interactive data mining

SPARK IS UNIQUE IN UNIFYING ALL IN A SINGLE FRAMEWORK 1. Batch, Interactive, Streaming computation models. 2. Machine learning, algorithms and graph-parallel programming models. SPARK Spark Streaming BlinkDB Shark SQL GraphX MLib

Latency (sec) PERFORMANCE 100 80 60 40 20 0 BI-Like Intermediate ETL-Like Spark Impala Hive https://amplab.cs.berkeley.edu/benchmark/

Latency (sec) PERFORMANCE 100 80 60 40 20 0 BI-Like Intermediate ETL-Like Spark Impala Hive https://amplab.cs.berkeley.edu/benchmark/

THE MOST ACTIVE BIG DATA APACHE PROJECT 1. Past 6 months: more active devs than Hadoop MapReduce 2. Better and faster evolutions: elegant and simple, fraction of code lines Vs. previous stack

THE MOST ACTIVE BIG DATA APACHE PROJECT 1. Past 6 months: more active devs than Hadoop MapReduce 2. Better and faster evolutions: elegant and simple, fraction of code lines Vs. previous stack

Code size THE MOST ACTIVE BIG DATA APACHE PROJECT 1. Past 6 months: more active devs than Hadoop MapReduce 2. Better and faster evolutions: elegant and simple, fraction of code lines Vs. previous stack 350.000 55.000 140.000 120.000 65.000 90.000 75.000 120.000 100.000 80.000 60.000 40.000 GraphX Shark Streaming 20.000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines

SPARK WORKING OVER CASSANDRA Two plus two is four? Sometimes Sometimes it is five. G. Orwell

DIFFERENT OPTIONS

1 CASSANDRA S HDFS ABSTRACTION LAYER Advantanges Easily integrates with legacy systems. Drawbacks Very high-level: no access to Cassandra s low level features. Questionable performance. INTEGRATION POINTS: HDFS OVER CASSANDRA

2 CASSANDRA S HADOOP INTERFACE Thrift protocol Obsolete CQLPagingInputFormat Under the hood keeps using thrift protocol No control over implemented behaviour INTEGRATION POINTS: HDFS OVER CASSANDRA

2 CASSANDRA S HADOOP INTERFACE Thrift protocol Obsolete CQLPagingInputFormat Under the hood keeps using thrift protocol No control over implemented behaviour INTEGRATION POINTS: HDFS OVER CASSANDRA

3 CASSANDRA S STRATIO INTERFACE Uses the modern Datastax Java driver Uses only CQL3 (No thrift) Full control over data locality No dependency on Hadoop No dependency on unmaintainable code Fine grained control over Cassandra splits INTEGRATION POINTS: HDFS OVER CASSANDRA

3 CASSANDRA S STRATIO INTERFACE Uses the modern Datastax Java driver Uses only CQL3 (No thrift) Full control over data locality No dependency on Hadoop No dependency on unmaintainable code Fine grained control over Cassandra splits INTEGRATION POINTS: HDFS OVER CASSANDRA

USER INTERFACE Stratio Deep provides two user interfaces: Domain object based Generic cell API INTEGRATION POINTS: HDFS OVER CASSANDRA

DOMAIN OBJECT BASED 1. API let s you map your Cassandra tables to object entities (POJOs), just like if you were using any other ORM. 2. This abstraction is quite handy, it will let you work on RDD (under the hood StratioDeep will map columns to entity properties). INTEGRATION POINTS: CASSANDRA S HADOOP INTERFACE CQL3

GENERIC CELL API 1. Allows to work on RDD where Cells is a collection of Cell objects. Column metadata is automatically fetched from the datastore. 2. This interface is a little bit more cumbersome to work with, but has the advantage that it does not require the definition of additional entity classes. INTEGRATION POINTS: CASSANDRA S HADOOP INTERFACE CQL3

SPARK-CASSANDRA ALLOWS SELECTING JUST THE INITIAL DATA SPARK NEEDS FROM THE CASSANDRA DATA STORE With the Spark-Cassandra integration we can leverage the power of Cassandra s main indexes, and especially secondary indexes in order to only fetch the data we need efficiently

WORK IN PROGRESS Stratio RT extension Built on-top of Spark Streaming API. Registration of new channels/consumer/producers Cross-module integration with StratioMeta Lets us create flows of data between StratioDeep StratioRT Materialized views, live queries, alerts, etc

WORK IN PROGRESS Stratio RT extension Built on-top of Spark Streaming API. Registration of new channels/consumer/producers Cross-module integration with StratioMeta Lets us create flows of data between StratioDeep StratioRT Materialized views, live queries, alerts, etc

WORK IN PROGRESS Stratio Meta is a SQL-Like language: Built on-top of Spark s API. Unique interface to all Stratio platform modules Smart engine to decide the best way to resolve the queries Mix stream processing and batch in one language

CONCLUSION With Spark-Cassandra we are covering a more complete use case A lot of users accessing a lot of data from applications or predefined reports. The needs of data analysts that can transform, analyze, and query high volumes of data openly with a more powerful data manipulation tool in their hands.

CONCLUSION With Spark-Cassandra we are covering a more complete use case A lot of users accessing a lot of data from applications or predefined reports. The needs of data analysts that can transform, analyze, and query high volumes of data openly with a more powerful data manipulation tool in their hands.

THANKS

stratio.com @stratiobd Álvaro Agea Herradón @alvaroagea Antonio Alcocer Falcón