DataStax Enterprise 3.x

Size: px

Start display at page:

Download "DataStax Enterprise 3.x"

Arron Wood
7 years ago
Views:

1 DataStax Enterprise 3.x Realtime Analytics with Solr Jason Rutherglen 2012 DataStax 1

2 About the Presenter Big Data Engineer at DataStax Co-author of Programming Hive and Lucene and Solr: The Definitive Guide from O Reilly 2012 DataStax 2

3 About DataStax The company behind Cassandra Sells DataStax Enterprise 2012 DataStax 3

4 DataStax Enterprise 3.x 2012 DataStax 4

5 DataStax Enterprise Single stack Cassandra Solr Hadoop Consulting Support 2012 DataStax 5

6 Cassandra at Netflix ZDNet article: The biggest cloud app of all: Netflix Built on Cassandra According to Cockcroft, if something goes wrong, Netflix can continue to run the entire service on two out of three zones 2012 DataStax 6

7 What is Big Data? Petabytes of growing data Hadoop is for batch work What are the solutions for realtime? 2012 DataStax 7

8 What is realtime? Near realtime 1000 millisecond latency 2012 DataStax 8

9 Why not relational databases? Cost of scaling to petabytes Physical limitations 2012 DataStax 9

10 Relational to Big Data Hadoop for batch Solr and Cassandra for realtime Gives most of relational capability at 1/10 the cost, scales linearly 2012 DataStax 10

11 Why Cassandra? Distributed database heavy lifting Simple dynamo model Executes replication tasks extremely well 2012 DataStax 11

12 Cassandra vs. HBase Cassandra is easier, code is readable Fewer moving parts Multi-datacenter replication Enables low level IO tuning 2012 DataStax 12

13 Cassandra vs. HBase HBase runs on HDFS HDFS is not designed for random access IO Multiple hacks / products to perform random access (MapR, HDFS Jiras) 2012 DataStax 13

14 Cassandra vs. HBase Cassandra is peer to peer, there is no single point of failure (SPOF) The HDFS name node is a single point of failure 2012 DataStax 14

15 HBase at Facebook Most of Facebook runs on MySQL Memcache front ends the reads 2012 DataStax 15

16 Batch Analytics Hive with Hadoop A vague dialect of SQL Requires Java for UDFs Relational Joins 2012 DataStax 16

17 Realtime Analytics Solr SQL features except relational joins Use Hive for relational joins CEP (Complex Event Processing) Storm 2012 DataStax 17

18 CEP (Complex Event Processing) Storm, computes results on streaming data 2012 DataStax 18

19 Lucene Java inverted indexing library Text analytics is raw computation over linear sets of data High speed computation engine 2012 DataStax 19

20 Inverted Indexes Terms dictionary points to list document ids (integers) Tokenizes text Complete variety of computation on vectors of data 2012 DataStax 20

21 Solr Search server built around Lucene Adds faceting, distributed search Missed the cloud environment features of NoSQL systems for many years 2012 DataStax 21

22 Solr Cloud Solr Cloud is a Zookeeper based system New and probably not production ready Playing catch up 2012 DataStax 22

23 Elastic Search High overlap with Solr More mature than Solr Cloud Less distributed features than Cassandra 2012 DataStax 23

24 Cassandra Concepts Columns, column families, keyspaces Peer to peer Eventual consistency Implements basic Google BigTable model 2012 DataStax 24

25 Lucene and Cassandra Both implement a log structured merge tree file architecture 2012 DataStax 25

26 DataStax Enterprise with Solr Data is stored in Cassandra Data placement controlled by Cassandra Solr is a secondary index (only) 2012 DataStax 26

27 DataStax Enterprise with Solr Separation of church and state, eg, data and index 2012 DataStax 27

28 Indexing Indexing is a CPU intensive task Not IO bound because of multithreading When a thread is flushing, other threads are indexing, CPU is saturated at all times 2012 DataStax 28

29 Queries IO bound, index needs to fit in RAM, then CPU bound Lucene enables multithreading queries Solr does not multithread queries 2012 DataStax 29

30 DataStax Enterprise with Solr Eventual consistency, each node has it s own Lucene index Lucene segment files are not replicated (like Solr Cloud and ElasticSearch) 2012 DataStax 30

31 Distributed Search Architecture Query requests are round robin d across nodes automatically 2012 DataStax 31

32 DSE is the current release of DataStax Enterprise 2012 DataStax 32

33 New Features in DSE Ease of re-indexing Re-index the entire cluster or pernode Re-indexing occurs when the Solr schema changes 2012 DataStax 33

34 New Features in DSE Solr Cloud requires re-indexing from an external data source such as a relational database 2012 DataStax 34

35 New Features in DSE DSE re-indexes directly from Cassandra No custom code is required for reindexing 2012 DataStax 35

36 New Features in DSE View the heap memory usage of the field caches Perform capacity planning 2012 DataStax 36

37 New Features in DSE Multithreaded re-indexing and repair Adding a new Solr node is fast 2012 DataStax 37

38 New Features in DSE Kerberos and SSL security Security audit logging 2012 DataStax 38

39 DataStax Enterprise 3.1 Near realtime: per-segment filters, facets, multivalue facets Solr DataStax 39

40 DataStax Enterprise 3.1 vnodes Composite keys 2012 DataStax 40

41 Future Multi datacenter live Solr schema updates and re-indexing CQL -> Solr queries, makes porting SQL applications easy for SQL developers 2012 DataStax 41

42 Demo of Wikipedia 2012 DataStax 42

43 Real World Example: Tick Data Details about every trade Tick data generated real time and is quantitatively query-able Too big to query on in real time? Not anymore! 2012 DataStax 43

44 Tick Data - Moving Average Computing the moving stock price average in real time Comparing multiple moving averages for different stock_symbols Requires statistical analysis, group by companies, and faceting features 2012 DataStax 44

45 Tick Data Analytics - Ad Hoc Searches Read latest ticks for a given company Query ticks for companies in specific verticals during large events such as press releases Compute deviation of stock data over 5 years for groups of companies 2012 DataStax 45

46 Real Time Stocks Demo 2012 DataStax 46

47 General 2012 DataStax 47

48 Schema Like an SQL CREATE TABLE statement Defines field types Defines fields 2012 DataStax 48

49 Solr Config XML based configuration options for Solr 2012 DataStax 49

50 Soft Commit Commits new index segment to RAM Avoids hard commit fsync 2012 DataStax 50

51 Auto Soft Commit  <updatehandler class="solr.directupdatehandler2 > <autosoftcommit> <maxtime>1000</maxtime> <! Near Realtime of 1 second --> </autosoftcommit> </updatehandler> 2012 DataStax 51

52 Field Cache Loaded for sort and facet queries Uses heap space 2012 DataStax 52

53 SolrJ / HTTP Java based API for interacting with a Solr server DSE supports SolrJ/HTTP with no changes 2012 DataStax 53

54 Insert data with CQL Auto data type mapping Copy fields Dynamic fields 2012 DataStax 54

55 CQL with Solr Query Exists however is mainly useful for debugging Limited functionality, queries a single node 2012 DataStax 55

56 CQL Insert Example INSERT INTO wikipedia (key, text) VALUES ('1', 'when in rome') 2012 DataStax 56

57 How to convert applications 2012 DataStax 57

58 SQL to Solr Common to convert existing SQL applications to Big Data Focus on the application functionality 2012 DataStax 58

59 SQL to Solr Cassandra makes all distributed operations easy 2012 DataStax 59

60 SELECT WHERE SELECT * FROM wikipedia WHERE type = pdf q=type:pdf 2012 DataStax 60

61 SELECT columns SELECT title,text FROM wikipedia q=*:* fl=title,text 2012 DataStax 61

62 SELECT COUNT SELECT COUNT(*) FROM wikipedia WHERE type = pdf q=type:pdf Get the num found 2012 DataStax 62

63 SELECT ORDER BY SELECT * FROM stocks ORDER BY price ASC q=*:* sort=price asc 2012 DataStax 63

64 SELECT AVG SELECT AVG(price) FROM stocks q=*:* stats=true stats.field=price The average is called mean in the Solr results 2012 DataStax 64

65 SELECT AVG GROUP BY SELECT AVG(price) FROM stocks GROUP BY symbol q=*:* stats=true stats.field=price stats.facet=symbol 2012 DataStax 65

66 SELECT WHERE LIKE SELECT * FROM wikipedia WHERE text LIKE rom% q=text:rom* 2012 DataStax 66

67 2012 DataStax 67

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage