Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

Similar documents

Search and Real-Time Analytics on Big Data

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop Ecosystem B Y R A H I M A.

A Performance Analysis of Distributed Indexing using Terrier

Apache Hadoop: Past, Present, and Future

Apache HBase. Crazy dances on the elephant back

Constructing a Data Lake: Hadoop and Oracle Database United!

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Dominik Wagenknecht Accenture

Hadoop & Spark Using Amazon EMR

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Oracle Database 12c Plug In. Switch On. Get SMART.

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop Trends and Practical Use Cases. April 2014

Upcoming Announcements

Luncheon Webinar Series May 13, 2013

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Integrating VoltDB with Hadoop

Dell In-Memory Appliance for Cloudera Enterprise

Data Integration Checklist

Log management with Logstash and Elasticsearch. Matteo Dessalvi

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Cloudera Certified Developer for Apache Hadoop

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Open source Google-style large scale data analysis with Hadoop

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

XpoLog Competitive Comparison Sheet

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

SQL on NoSQL (and all of the data) With Apache Drill

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

How To Use Big Data For Telco (For A Telco)

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

BIG DATA HADOOP TRAINING

Moving From Hadoop to Spark

Oracle Big Data Fundamentals Ed 1 NEW

Important Notice. (c) Cloudera, Inc. All rights reserved.

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

HDP Hadoop From concept to deployment.

Implement Hadoop jobs to extract business value from large and varied data sets

ROCANA WHITEPAPER Rocana Ops Architecture

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

How To Scale Out Of A Nosql Database

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Cloudera Manager Training: Hands-On Exercises

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Comparing SQL and NOSQL databases

<Insert Picture Here> Big Data

Implementing Search in Web, Mobile, and IOT Applications An Overview of DataStax Enterprise Search

Headline Goes Here Hari Shreedharan Speaker Name or Subhead Goes Here

MySQL and Hadoop. Percona Live 2014 Chris Schneider

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Openbus Documentation

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Unified Big Data Processing with Apache Spark. Matei

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Deploying Hadoop with Manager

Hadoop: The Definitive Guide

The Future of Data Management with Hadoop and the Enterprise Data Hub

The Enterprise Data Hub and The Modern Information Architecture

Hadoop IST 734 SS CHUNG

Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)

File S1: Supplementary Information of CloudDOE

Complete Java Classes Hadoop Syllabus Contact No:

Open source large scale distributed data management with Google s MapReduce and Bigtable

NoSQL Roadshow Berlin Kai Spichale

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Big Data Spatial Analytics An Introduction

Big Data Too Big To Ignore

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Workshop on Hadoop with Big Data

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Peers Techno log ies Pv t. L td. HADOOP

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Transcription:

Finding the Needle in a Big Data Haystack Wolfgang Hoschek (@whoschek) JAX 2014 1

About Wolfgang Software Engineer @ Cloudera Search Platform Team Previously CERN, Lawrence Berkeley National Laboratory, Skytide Very Large Scale Analytics Platforms, Distributed OLAP Servers, Search Servers, BI Committer on Apache Solr/Lucene, Flume, Lily HBase Indexer, Kite, Morphlines 2

About Mark Miller Lucene / Solr committer. Previously, Core Engineering Manager at LucidWorks. Currently, Software Engineer at Cloudera. Co-creator of SolrCloud - Solr s distributed / clustering features. 3

Agenda Big Data and Search setting the stage Cloudera Search s Architecture Near Real Time and Batch Use Cases Conclusion and Q&A 4

The Enterprise Data Hub Online NoSQL DBMS Analytic MPP DBMS Search Engine Batch Processin g Resource Management Stream Processin g Machine Learning Unified Scale-out Storage For Any Type of Data Elastic, Fault-tolerant, Self-healing, In-memory capabilities SQL Streaming File System (NFS) System Manage ment M e t a d a t a, S e c u r it y, A u d it, L i n e a g e Data Manage ment Multiple processing frameworks One pool of data One set of system resources One management interface One security framework 5

Search Simplifies Interaction - to Everyone! Explore Navigate Correlate Experts know MapReduce. Savvy people know SQL. Everyone knows Search! 6

What is Cloudera Search? Interactive search for Hadoop Full-text and faceted navigation Batch, near real-time, and on-demand indexing Apache Solr integrated with CDH Established, mature search with vibrant community Incorporated as part of the Hadoop ecosystem Apache Flume, Apache HBase Apache MapReduce, Kite Morphlines Open Source 100% Apache, 100% Solr Standard Solr APIs 7

Cloudera Search Architecture Overview Online Streaming Data End User Client App (e.g. Hue) OLTP Data $ Search queries 8 Flume Raw, filtered, or annotated data NRT Data indexed w/ Morphlines SolrCloud Cluster(s) Indexed data HDFS GoLive updates NRT Replication Events indexed w/ Morphlines HBase Cluster MapReduce Batch Indexing w/ Morphlines Cloudera Manager

Challenges Scalable/Reliable Index Storage Near Real Time (NRT) indexing Scalable Batch Indexing Usability 9

Apache Lucene/Solr Lucene - full text search library Solr search service on Lucene SolrCloud distributed search Partitioned and replicated inverted index Low latency, scalable, reliable, HA, secure 10

Scalable and Robust Index Storage Querying API Indexing API SolrCloud Zookeeper Solr Extraction Mapping Lucene Solr and HDFS Scalable, costefficient index storage Higher availability Search and process data in one platform HDFS 11

Integrate Solr/Lucene with HDFS Lucene Directory Abstraction Implemented HDFSDirectory using HDFS client library Read/Write index files directly to HDFS Solr DirectoryFactory Abstraction HDFSDirectoryFactory plugs HDFSDirectory into Solr Configuration Solr and HDFS Solr HdfsTransactionLog Puts transaction log into hdfs so that local file storage is not a concern and for replica failover on hdfs. 12

Auto Replica Failover When indexes are in HDFS, if a node goes down, you can start serving the index for the downed machine from a machine that is still up. Currently under development.! Adds a level of failover and availability that is not available when storing indexes on the local filesystem. 13

Near Real Time Indexing with Apache Flume Log File Flume Agent Indexer w/ Morphlines Other Log File Flume Agent Indexer w/ Morphlines Solr and Flume Reliable streaming data ingestion at scale Indexing at data ingest Flexible ETL via Morphlines Packaged as Flume Morphline Solr Sink 14 SolrCloud HDFS SolrCloud Flume.conf agent.sinks.solrsink.type = org.apache.flume.sink.solr.morphline.morphlinesolrsink agent.sinks.solrsink.morphlinefile = /etc/flume-ng/conf/morphline.conf

Searchable Real-Time Data Indexing HBase Interactive OLTP updates HBase Triggers on updates HDFS Lily HBase Indexer(s) w/ Morphlines Solr server Solr server Solr server Solr server SolrCloud <indexer table= myhbasetable" mapper="com.ngdata.hbaseindexer.morphline.morphlineresulttosolrmapper"> <param name="morphlinefile" value="/path/to/extracthbasecell.conf"/> 15 Updates Solr index immediately on HBase cell update Non-intrusive, Flexible, Scalable, Reliable Partitioned and Replicated Indexing w/ Failover, HA hbase-indexer.conf

Lily HBase Indexer Collaboration between NGData & Cloudera NGData are creators of the Lily data management platform Lily HBase Indexer Service which acts as a HBase replication listener Replication updates trigger indexing of updates (rows/ cells) Integrates Kite Morphlines library for flexible ETL of rows/ cells ASL licensed on github http://github.com/ngdata/hbaseindexer 16

Scalable Batch ETL & Indexing Solr server Solr server HDFS Files or HBase tables Index shard Indexer w/ Morphlines GOLIVE Index shard Indexer w/ Morphlines Solr and MapReduce Flexible, scalable, reliable batch indexing On-demand indexing, cost-efficient re-indexing Start serving new indices without downtime MapReduceIndexerTool HBaseMapReduceIndexerTool hadoop... MapReduceIndexerTool --morphline-file morphline.conf... 17

Scalable Batch Indexing Morphline Mapper: Parse HDFS input into indexable document Morphline Mapper: Parse HDFS input into indexable document Morphline Mapper: Parse HDFS input into indexable document Arbitrary reduce steps of indexing & merging using Solr/Lucene (can use all reducer slots for scalability even if #reducers >> #solrshards) SolrCloud on HDFS Solr Shard 1 Solr Shard N GOLIVE GOLIVE Solr Server 1 Solr Server N 18

MapReduce Indexer MapReduce Job with two parts 1) Scan HDFS (or HBase) for files to be indexed 2) Mapper/Reducer indexing step Mapper extracts content via Kite Morphlines Reducer uses Lucene to index documents directly to HDFS golive Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster No downtime for users No NRT expense Linear scale out to the size of your MR cluster 19

Streaming ETL (Extract, Transform, Load) syslog Solr sink Flume Agent Command: readline Command: grok Command: loadsolr Solr Event Record Record Record Document Kite Morphlines Consume any kind of data from any kind of data source, process and load into Solr, HDFS, HBase or anything else Simple and flexible data transformation Extensible set of transf. commands Reusable across multiple workloads For Batch & Near Real Time Configuration over coding reduces time & skills ASL licensed on github https:// github.com/kite-sdk/kite 20

Kite Morphlines Architecture Morphlines can be embedded in any application SolrCloud Solr Logs, tweets, social media, html, images, pdf, text.! Anything you want to index Flume, MR Indexer, HBase indexer, etc... Or your application! Morphline Library Solr Solr Your App! 21

Morphline Example syslog with grok 22 morphlines : [ { id : morphline1 importcommands : [ org.kitesdk.**", "org.apache.solr.**"] commands : [ { readline {} } { grok { dictionaryfiles : [/tmp/grok-dictionaries] expressions : { message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} % {SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: % {GREEDYDATA:syslog_message}""" } } ] } } { loadsolr {} } ] Example Input <164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:feb 4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.

Example: Indexing log4j w/ stacktraces Record 1 juil. 25, 2012 10:49:40 AM hudson.triggers.safetimertask run ok juil. 25, 2012 10:49:46 AM hudson.triggers.safetimertask run failed com.amazonaws.amazonclientexception: Unable to calculate a request signature at com.amazonaws.auth.abstractawssigner.signandbase64encode(abstractawssigner.java:71) at java.util.timerthread.run(timer.java:505) Caused by: com.amazonaws.amazonclientexception: Unable to calculate a request signature at com.amazonaws.auth.abstractawssigner.sign(abstractawssigner.java:90) at com.amazonaws.auth.abstractawssigner.signandbase64encode(abstractawssigner.java:68)... 14 more Caused by: java.lang.illegalargumentexception: Empty key at javax.crypto.spec.secretkeyspec.<init>(secretkeyspec.java:96) at com.amazonaws.auth.abstractawssigner.sign(abstractawssigner.java:87)... 15 more juil. 25, 2012 10:49:54 AM hudson.slaves.slavecomputer tryreconnect Record 2 Record 3 23

Example: Indexing log4j w/ stacktraces morphlines : [ { id : morphline1 importcommands : [ org.kitesdk.**", "org.apache.solr.**"] ] } commands : [ { readmultiline { regex : "(^.+Exception:.+) (^\\s+at.+) (^\\s+... \\d+ more) (^\\s*caused by:.+)" what : previous charset : UTF-8 } } { logdebug { format : "output record: {}", args : ["@{}"] } } { loadsolr {} ] 24

Example: Escape to Java Code morphlines : [ { id : morphline1 importcommands : [ org.kitesdk.** ] commands : [ { java { code: """ List tags = record.get("tags"); if (!tags.contains("hello")) { return false; } tags.add("world"); return child.process(record); """ } } ] } ] 25

Current Morphline Command Library Supported Data Formats Text: Single-line record, multi-line records, CSV, CLOB Apache Avro, Parquet files Apache Hadoop Sequence Files Apache Hadoop RCFiles JSON XML, XPath, XQuery Via Apache Tika: HTML, PDF, MS-Office, Images, Audio, Video, Email HBase rows/cells Via pluggable commands: Your custom data formats Regex based pattern matching and extraction Flexible log file analysis Integrate with and load data into Apache Solr Auto-detection of MIME types from binary data using Apache Tika 26

Current Morphline Command Library (cont d) Scripting support for dynamic Java code Geolocation info for an IP address (Maxmind) Operations on fields for assignment and comparison Operations on fields with list and set semantics if-then-else conditionals A small rules engine (tryrules) String and timestamp conversions slf4j logging Yammer metrics and counters Decompression and unpacking of arbitrarily nested container file formats etc, etc 27

Morphline Performance and Scaling 28 The runtime compiles morphline on the fly The runtime processes all commands of a given morphline in the same thread Piping a record from one command to another is fast just a cheap Java method call no queues, no handoffs among threads, no context switches, and no serialization between commands For scalability, deploy many morphline instances on a cluster in many Flume agents and MapReduce tasks

Simple, Customizable Search Interface Hue Simple UI Navigated, faceted drill down Customizable display Full text search, standard Solr API and query language 29

Security Cloudera Search + Sentry Cluster level access control Index level access control Document level access control Field level access control 30

Simplified Management Cloudera Manager Install, configure, deploy Solr services on the cluster Unified management and monitoring Resource management 31

32 Cloudera Manager Flume Morphline GUI

Conclusion and Getting Started Cloudera Search for CDH 4 & CDH5 Free Download Extensive documentation Send your questions and feedback to searchuser@cloudera.org Take the Search online training Get started with a Live Instance @ Cloudera Live No download, no installation, no waiting! QuickStart VM also available 33

2014 Cloudera, Inc. All rights reserved.