Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Size: px

Start display at page:

Download "Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com"

Elisabeth Collins
8 years ago
Views:

1 1 Katta & Hadoop Katta - Distributed Lucene Index in Production Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com foto by: belgianchocolate@flickr.com

2 2 Intro Business intelligence reports from event stream Existing event stream processing platform V1 Build on top of oracle Scale problems Expensive Slow Hugh star schema New report expensiv to develop Expensive to keep old data

top of oracle Scale problems Expensive Slow Hugh star

3 3 Goals Build next generation platform for event stream processing Faster report development - plugins Reduce total coast of ownership No license fees, open source based Commodity hardware Lower maintenance coasts Better scalable Better performance Cheap storage

of ownership No license fees, open source based Commodity

4 4 Challenge Integrate system into big picture Log data via JMS Report WebApp uses jdbc Report developers do not know Map Redcuce but SQL, XPath etc. Which format store data in? Which format process records in? Where store processing results in?

Map Redcuce but SQL, XPath etc. Which format store data in?

5 5 Challenges II Teleskop to Microscope - zoom to log record level One report - many mr jobs Job Scheduling Enterprise 24/7 monitoring - SNMP Work with open source releases cycles

6 6 Our Solution I Monitor and manage everything Web - Console Distributed index for log message retrival Customer Userinterface Files by organized by day Aggregate data and generate report data Katta 8% 7% 35% 10% 11% 29% JMS MSG JMS MSG JMS MSG binary feed DFS Hadoop MR Pig Database Web Page Convert logs to measures Store results of pig queries

generate report data Katta 8% 7% 35% 10% 11% 29% JMS MSG JMS MSG JMS MSG binary feed

7 7 Our Solution II Binary tree format xml > tuples text tuples SQL Schema JMS DFS MR PIG DB

8 8 Katta Serving indexes the hadoop distributed file system way Index as index shards on many servers Replicate shards on different servers for performance and faulttolerance Lightweight Master fail over Fast* Easy to integrate Plays well with hadoop clusters Apache Version 2 License

for performance and faulttolerance Lightweight Master fail over Fast*

9 9 Contras No realtime updates like Solr, Couch DB or Cassandra yet* * though on roadmap Index serving tool, not indexer

10 10 What is a Katta index? Katta Index Lucene Index Lucene Index Lucene Index Folder with Lucene indexes Shard Indexes can be zipped

11 11 hadoop cluster or single server Overview <REST API/> * create index and copy to shared filesystem HDFS, NAS or shared local filesystem fail over command line management Master Secondary Master java API Zookeeper Zookeeper server nodes in the grid assign shards download shards Node Node Node Node shard replication (plug-able policy) multicast query multicast query distributed ranking plug-able selection policy (custom load balancing) java client API

server nodes in the grid assign shards download shards Node Node Node Node shard replication (plug-able policy)

12 CLI 12

13 API 13

14 14 Lucene Queries title:"the Right Way" AND text:go te?t or test* or te*t mod_date:[ TO ] state:ca AND age:[1 TO 15] AND product:ipod state:ca AND age:[16 TO 21] AND product:ipod

t or test* or te*t mod_date:[20020101 TO

15 15 Teleskop to Microscope Create Index from XML in MR stage Deploy indexes in katta Merge indexes frequently together Find documents by key Find documents by query

16 16 XML to Lucene Document <event id= akey type= sell > <product id= ipod /> <user id= stefan state= CA age= 31 /> </event> age:31

</event> /event/@id:akey /event/@type:sell

17 17 Range Queries AND AND age:[001 TO 010] AND AND age:[011 TO 020] AND AND age:[021 TO 030] AND AND age:[031 TO 040] Counting results -> one network round trip

/event/product/@id:ipod AND /event/user/@state:ca AND /event/user/@ age:[021 TO 030]

18 18 Range Queries Result Graph 60,000 45,000 30,000 15,

19 19 Pros Easy reports can be generated from katta index Complex reports generated with many pig statements (>30 job) Zoom in data from complex reports System scales Scaling is cheap We keep more data Report developing is easy

job) Zoom in data from complex reports System scales

20 20 Problems There was no cascading, hive or jaql, pig was very young Develop against changing open source project (hadoop, pig) Pig is/was slow (always text) and (was) buggy Katta indexes need to merged frequently Monitoring and management

(hadoop, pig) Pig is/was slow (always text) and (was) buggy

21 21 Roadmap 0.1 released 0.2 Hadoop Hadoop 0.18 Performance improvements EC2 support Add realtime update support Not yet clear how exactly Might be similar to Dynamo

22 22 Thanks katta.sourceforge.net sg{at}101tec.com

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in