Brisk: More Powerful Hadoop Powered by Cassandra jbellis@datastax.com
The evolution of Analytics Analytics + Realtime
The evolution of Analytics replication Analytics Realtime
The evolution of Analytics ETL
Brisk re-unifies realtime and analytics
The Traditional Hadoop Stack Master Nodes Name Node Secondary Name Node Job Tracker Hbase Master ZooKeeper MetaStore Slave Nodes Data Node Task Tracker Region Server Client Nodes Pig Hive Region Server
7
Brisk Architecture
Brisk Highlights Easy to deploy and operate No single points of failure Scale and change nodes with no downtime Cross-DC, multi-master clusters Allocate resources for OLAP vs OLTP With no ETL
Cassandra data model ColumnFamilies contain rows + columns (Not really schemaless for a while now) zznate driftx jbellis password name site * Nate McCall * Brandon Williams * Jonathan Ellis datastax.com
Sparse zznate driftx jbellis password name * Nate McCall password name * Brandon Williams password name site * Jonathan Ellis datastax.com
Rows as containers / materialized views circle1 driftx thobbs pcmanus jbellis zznate circle2 xedin mdennis circle3 xedin pcmanus ymorishita
CassandraFS data stored as ByteBuffer internally -- excellent fit for blocks local reads mmap data directly (no rpc) blocks are compressed with google snappy hadoop distcp hdfs:///mydata cfs:///mydata
Hive support Hive MetaStore in Cassandra Unified schema view from any node, with no external systems and no SPOF Automatically maps Cassandra column families to Hive tables Supports static and dynamic column families (and supercolumns)
Hive: CFS and ColumnFamilies CREATE TABLE users (name STRING, zip INT); LOAD DATA LOCAL INPATH 'kv2.txt' OVERWRITE INTO TABLE users; CREATE EXTERNAL TABLE Keyspace1.Users(name STRING, zip INT) STORED BY 'org.apache.hadoop.hive.cassandra.cassandrastoragehandler'; CREATE EXTERNAL TABLE Keyspace1.Users (row_key STRING, column_name STRING, value string) STORED BY 'org.apache.hadoop.hive.cassandra.cassandrastoragehandler';
Pig Support With standard Cassandra: $ export PIG_HOME=/path/to/pig $ export PIG_INITIAL_ADDRESS=localhost $ export PIG_RPC_PORT=9160 $ export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner $ contrib/pig/bin/pig_cassandra grunt> With Brisk: $ bin/brisk pig grunt>
Pig: CFS and ColumnFamilies grunt> data = LOAD 'cfs:///example.txt' using PigStorage() as (name:chararray, value:long); data = LOAD 'cassandra://demo1/scores' using CassandraStorage() AS (key, columns: {T: tuple(name, value)}); data = LOAD 'cassandra://demo1/scores&slice_start=m&slice_end=s' using CassandraStorage() AS (key, columns: {T: tuple(name, value)});
19
Data model: Realtime LiveStocks GOOG AAPL AMZN last $95.52 $186.10 $112.98 Portfolios Portfolio1 GOOG LNKD P AMZN AAPLE 80 20 40 100 20 StockHist GOOG 2011-01-01 2011-01-02 2011-01-03 $79.85 $75.23 $82.11
Data model: Analytics HistLoss Portfolio1 Portfolio2 Portfolio3 worst_date loss 2011-07-23 -$34.81 2011-03-11 -$11432.24 2011-05-21 -$1476.93
Data model: Analytics 10dayreturns ticker rdate return GOOG 2011-07-25 $8.23 GOOG 2011-07-24 $6.14 GOOG 2011-07-23 $7.78 AAPL 2011-07-25 $15.32 AAPL 2011-07-24 $12.68 INSERT OVERWRITE TABLE 10dayreturns SELECT a.row_key ticker, b.column_name rdate, b.value - a.value FROM StockHist a JOIN StockHist b ON (a.row_key = b.row_key AND date_add(a.column_name,10) = b.column_name);
GOOG 2011-01-01 2011-01-02 2011-01-03 $79.85 $75.23 $82.11 row_key column_name value GOOG 2011-01-01 $8.23 GOOG 2011-01-02 $6.14 GOOG 2011-001-03 $7.78
Data model: Analytics portfolio_returns portfolio rdate preturn Portfolio1 2011-07-25 $118.21 Portfolio1 2011-07-24 $60.78 Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-07-25 $2143.92 Portfolio3 2011-07-24 -$10.19 INSERT OVERWRITE TABLE portfolio_returns SELECT row_key portfolio, rdate, SUM(b.return) FROM portfolios a JOIN 10dayreturns b ON (a.column_name = b.ticker) GROUP BY row_key, rdate;
Data model: Analytics HistLoss Portfolio1 Portfolio2 Portfolio3 worst_date loss 2011-07-23 -$34.81 2011-03-11 -$11432.24 2011-05-21 -$1476.93 INSERT OVERWRITE TABLE HistLoss SELECT a.portfolio, rdate, minp FROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio ) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);
Portfolio Demo dataflow Portfolios Historical Prices Web-based Portfolios Live Prices for today Intermediate Results Largest loss Largest loss
OpsCenter
Where to get it http://www.datastax.com/brisk