DataStax Enterprise 3.x

Similar documents
Apache HBase. Crazy dances on the elephant back

Introduction to Apache Cassandra

Search and Real-Time Analytics on Big Data

LARGE-SCALE DATA STORAGE APPLICATIONS

Hadoop Ecosystem B Y R A H I M A.

Implementing Search in Web, Mobile, and IOT Applications An Overview of DataStax Enterprise Search

So What s the Big Deal?

How To Scale Out Of A Nosql Database

Dominik Wagenknecht Accenture

Practical Cassandra. Vitalii

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Apache Cassandra Present and Future. Jonathan Ellis

Peers Techno log ies Pv t. L td. HADOOP

Building a Scalable News Feed Web Service in Clojure

Large Scale Text Analysis Using the Map/Reduce

Introduction to Big Data Training

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Getting Started with SandStorm NoSQL Benchmark

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

DataStax Enterprise Reference Architecture

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Unified Big Data Processing with Apache Spark. Matei

Comparing Scalable NOSQL Databases

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Data processing goes big

Benchmarking Cassandra on Violin

Real-time Big Data Analytics with Storm

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Apache Sentry. Prasad Mujumdar

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Performance Testing of Big Data Applications

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Ankush Cluster Manager - Hadoop2 Technology User Guide

Open source Google-style large scale data analysis with Hadoop

Using distributed technologies to analyze Big Data

Information Retrieval Elasticsearch

Complete Java Classes Hadoop Syllabus Contact No:

The Internet of Things and Big Data: Intro

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

NoSQL Data Base Basics

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Open source large scale distributed data management with Google s MapReduce and Bigtable

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Database Scalability and Oracle 12c

Cassandra A Decentralized Structured Storage System

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Cloudsandra on Brisk. Isidorey

Big data blue print for cloud architecture

Performance and Scalability Overview

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Hadoop IST 734 SS CHUNG

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

6.S897 Large-Scale Systems

DataStax Enterprise Reference Architecture. White Paper

Enabling SOX Compliance on DataStax Enterprise

NoSQL Roadshow Berlin Kai Spichale

Workshop on Hadoop with Big Data

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Complying with Payment Card Industry (PCI-DSS) Requirements with DataStax and Vormetric

GigaSpaces Real-Time Analytics for Big Data

Big Data: Beyond the Hype

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Comparing SQL and NOSQL databases

Evaluating Apache Cassandra as a Cloud Database WHITE PAPER

Big Systems, Big Data

ZingMe Practice For Building Scalable PHP Website. By Chau Nguyen Nhat Thanh ZingMe Technical Manager Web Technical - VNG

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

How to Hadoop Without the Worry: Protecting Big Data at Scale

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Current Data Security Issues of NoSQL Databases

Moving From Hadoop to Spark

Upcoming Announcements

Processing millions of logs with Logstash

INTRODUCTION TO CASSANDRA

Big Data and Industrial Internet

A Performance Analysis of Distributed Indexing using Terrier

Cassandra. Jonathan Ellis

Scalable Architecture on Amazon AWS Cloud

Simba Apache Cassandra ODBC Driver

Performance and Scalability Overview

Distributed Computing and Big Data: Hadoop and MapReduce

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Data Services Advisory

COURSE CONTENT Big Data and Hadoop Training

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hypertable Architecture Overview

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Transcription:

DataStax Enterprise 3.x Realtime Analytics with Solr Jason Rutherglen 2012 DataStax 1

About the Presenter Big Data Engineer at DataStax Co-author of Programming Hive and Lucene and Solr: The Definitive Guide from O Reilly 2012 DataStax 2

About DataStax The company behind Cassandra Sells DataStax Enterprise 2012 DataStax 3

DataStax Enterprise 3.x 2012 DataStax 4

DataStax Enterprise Single stack Cassandra Solr Hadoop Consulting Support 2012 DataStax 5

Cassandra at Netflix ZDNet article: The biggest cloud app of all: Netflix http://zd.net/zhtrmw Built on Cassandra According to Cockcroft, if something goes wrong, Netflix can continue to run the entire service on two out of three zones 2012 DataStax 6

What is Big Data? Petabytes of growing data Hadoop is for batch work What are the solutions for realtime? 2012 DataStax 7

What is realtime? Near realtime 1000 millisecond latency 2012 DataStax 8

Why not relational databases? Cost of scaling to petabytes Physical limitations 2012 DataStax 9

Relational to Big Data Hadoop for batch Solr and Cassandra for realtime Gives most of relational capability at 1/10 the cost, scales linearly 2012 DataStax 10

Why Cassandra? Distributed database heavy lifting Simple dynamo model Executes replication tasks extremely well 2012 DataStax 11

Cassandra vs. HBase Cassandra is easier, code is readable Fewer moving parts Multi-datacenter replication Enables low level IO tuning 2012 DataStax 12

Cassandra vs. HBase HBase runs on HDFS HDFS is not designed for random access IO Multiple hacks / products to perform random access (MapR, HDFS Jiras) 2012 DataStax 13

Cassandra vs. HBase Cassandra is peer to peer, there is no single point of failure (SPOF) The HDFS name node is a single point of failure 2012 DataStax 14

HBase at Facebook Most of Facebook runs on MySQL Memcache front ends the reads 2012 DataStax 15

Batch Analytics Hive with Hadoop A vague dialect of SQL Requires Java for UDFs Relational Joins 2012 DataStax 16

Realtime Analytics Solr SQL features except relational joins Use Hive for relational joins CEP (Complex Event Processing) Storm 2012 DataStax 17

CEP (Complex Event Processing) Storm, computes results on streaming data 2012 DataStax 18

Lucene Java inverted indexing library Text analytics is raw computation over linear sets of data High speed computation engine 2012 DataStax 19

Inverted Indexes Terms dictionary points to list document ids (integers) Tokenizes text Complete variety of computation on vectors of data 2012 DataStax 20

Solr Search server built around Lucene Adds faceting, distributed search Missed the cloud environment features of NoSQL systems for many years 2012 DataStax 21

Solr Cloud Solr Cloud is a Zookeeper based system New and probably not production ready Playing catch up 2012 DataStax 22

Elastic Search High overlap with Solr More mature than Solr Cloud Less distributed features than Cassandra 2012 DataStax 23

Cassandra Concepts Columns, column families, keyspaces Peer to peer Eventual consistency Implements basic Google BigTable model 2012 DataStax 24

Lucene and Cassandra Both implement a log structured merge tree file architecture 2012 DataStax 25

DataStax Enterprise with Solr Data is stored in Cassandra Data placement controlled by Cassandra Solr is a secondary index (only) 2012 DataStax 26

DataStax Enterprise with Solr Separation of church and state, eg, data and index 2012 DataStax 27

Indexing Indexing is a CPU intensive task Not IO bound because of multithreading When a thread is flushing, other threads are indexing, CPU is saturated at all times 2012 DataStax 28

Queries IO bound, index needs to fit in RAM, then CPU bound Lucene enables multithreading queries Solr does not multithread queries 2012 DataStax 29

DataStax Enterprise with Solr Eventual consistency, each node has it s own Lucene index Lucene segment files are not replicated (like Solr Cloud and ElasticSearch) 2012 DataStax 30

Distributed Search Architecture Query requests are round robin d across nodes automatically 2012 DataStax 31

DSE 3.0.1 3.0.1 is the current release of DataStax Enterprise 2012 DataStax 32

New Features in DSE 3.0.1 Ease of re-indexing Re-index the entire cluster or pernode Re-indexing occurs when the Solr schema changes 2012 DataStax 33

New Features in DSE 3.0.1 Solr Cloud requires re-indexing from an external data source such as a relational database 2012 DataStax 34

New Features in DSE 3.0.1 DSE re-indexes directly from Cassandra No custom code is required for reindexing 2012 DataStax 35

New Features in DSE 3.0.1 View the heap memory usage of the field caches Perform capacity planning 2012 DataStax 36

New Features in DSE 3.0.1 Multithreaded re-indexing and repair Adding a new Solr node is fast 2012 DataStax 37

New Features in DSE 3.0.1 Kerberos and SSL security Security audit logging 2012 DataStax 38

DataStax Enterprise 3.1 Near realtime: per-segment filters, facets, multivalue facets Solr 4.3 2012 DataStax 39

DataStax Enterprise 3.1 vnodes Composite keys 2012 DataStax 40

Future Multi datacenter live Solr schema updates and re-indexing CQL -> Solr queries, makes porting SQL applications easy for SQL developers 2012 DataStax 41

Demo of Wikipedia 2012 DataStax 42

Real World Example: Tick Data Details about every trade Tick data generated real time and is quantitatively query-able Too big to query on in real time? Not anymore! 2012 DataStax 43

Tick Data - Moving Average Computing the moving stock price average in real time Comparing multiple moving averages for different stock_symbols Requires statistical analysis, group by companies, and faceting features 2012 DataStax 44

Tick Data Analytics - Ad Hoc Searches Read latest ticks for a given company Query ticks for companies in specific verticals during large events such as press releases Compute deviation of stock data over 5 years for groups of companies 2012 DataStax 45

Real Time Stocks Demo 2012 DataStax 46

General 2012 DataStax 47

Schema Like an SQL CREATE TABLE statement Defines field types Defines fields 2012 DataStax 48

Solr Config XML based configuration options for Solr 2012 DataStax 49

Soft Commit Commits new index segment to RAM Avoids hard commit fsync 2012 DataStax 50

Auto Soft Commit <!-- The default high-performance update handler --> <updatehandler class="solr.directupdatehandler2 > <autosoftcommit> <maxtime>1000</maxtime> <! Near Realtime of 1 second --> </autosoftcommit> </updatehandler> 2012 DataStax 51

Field Cache Loaded for sort and facet queries Uses heap space 2012 DataStax 52

SolrJ / HTTP Java based API for interacting with a Solr server DSE supports SolrJ/HTTP with no changes 2012 DataStax 53

Insert data with CQL Auto data type mapping Copy fields Dynamic fields 2012 DataStax 54

CQL with Solr Query Exists however is mainly useful for debugging Limited functionality, queries a single node 2012 DataStax 55

CQL Insert Example INSERT INTO wikipedia (key, text) VALUES ('1', 'when in rome') 2012 DataStax 56

How to convert applications 2012 DataStax 57

SQL to Solr Common to convert existing SQL applications to Big Data Focus on the application functionality 2012 DataStax 58

SQL to Solr Cassandra makes all distributed operations easy 2012 DataStax 59

SELECT WHERE SELECT * FROM wikipedia WHERE type = pdf q=type:pdf 2012 DataStax 60

SELECT columns SELECT title,text FROM wikipedia q=*:* fl=title,text 2012 DataStax 61

SELECT COUNT SELECT COUNT(*) FROM wikipedia WHERE type = pdf q=type:pdf Get the num found 2012 DataStax 62

SELECT ORDER BY SELECT * FROM stocks ORDER BY price ASC q=*:* sort=price asc 2012 DataStax 63

SELECT AVG SELECT AVG(price) FROM stocks q=*:* stats=true stats.field=price The average is called mean in the Solr results 2012 DataStax 64

SELECT AVG GROUP BY SELECT AVG(price) FROM stocks GROUP BY symbol q=*:* stats=true stats.field=price stats.facet=symbol 2012 DataStax 65

SELECT WHERE LIKE SELECT * FROM wikipedia WHERE text LIKE rom% q=text:rom* 2012 DataStax 66

2012 DataStax 67