Real-World Analytics with Solr Cloud and Spark

Similar documents

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

A Performance Analysis of Distributed Indexing using Terrier

Hadoop Ecosystem B Y R A H I M A.

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

How To Create A Data Visualization With Apache Spark And Zeppelin

Search and Real-Time Analytics on Big Data

Moving From Hadoop to Spark

NoSQL and Hadoop Technologies On Oracle Cloud

Scaling Out With Apache Spark. DTL Meeting Slides based on

How Companies are! Using Spark

Apache HBase. Crazy dances on the elephant back

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Systems, Big Data

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Big Data With Hadoop

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Architectures for massive data management

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Introduction to Big Data Training

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Oracle Big Data SQL Technical Update

Spark: Cluster Computing with Working Sets

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Unified Big Data Processing with Apache Spark. Matei

In Memory Accelerator for MongoDB

How To Scale Out Of A Nosql Database

Hadoop & Spark Using Amazon EMR

MongoDB and Couchbase

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop IST 734 SS CHUNG

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Architectures for massive data management

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

CitusDB Architecture for Real-Time Big Data

Introduction to Cassandra

Hybrid Software Architectures for Big

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Large Scale Text Analysis Using the Map/Reduce

NoSQL for SQL Professionals William McKnight

An Approach to Implement Map Reduce with NoSQL Databases

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Upcoming Announcements

In-Memory Databases MemSQL

Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Performance and Scalability Overview

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

The Hadoop Eco System Shanghai Data Science Meetup

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Brave New World: Hadoop vs. Spark

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Certified Big Data and Apache Hadoop Developer VS-1221

Technical Overview Simple, Scalable, Object Storage Software

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Practical Cassandra. Vitalii

Unified Big Data Analytics Pipeline. 连城

Real Time Data Processing using Spark Streaming

HDFS. Hadoop Distributed File System

Evaluation of NoSQL databases for large-scale decentralized microblogging

Dominik Wagenknecht Accenture

Big Data and Scripting Systems build on top of Hadoop

Big Data Analytics with Cassandra, Spark & MLLib

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Database Performance with In-Memory Solutions

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

Deploying and Managing SolrCloud in the Cloud ApacheCon, April 8, 2014 Timothy Potter. Search Discover Analyze

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

NoSQL: Going Beyond Structured Data and RDBMS

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Parallel Replication for MySQL in 5 Minutes or Less

Search-based business intelligence and reverse data engineering with Apache Solr

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Liferay Portal s Document Library: Architectural Overview, Performance and Scalability

Constructing a Data Lake: Hadoop and Oracle Database United!

HiBench Introduction. Carson Wang Software & Services Group

Cloud Computing at Google. Architecture

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

BIG DATA TRENDS AND TECHNOLOGIES

Transcription:

Real-World Analytics with Solr Cloud and Spark Solving Analytic Problems for Billions of Records Within Seconds Johannes Weigend Apache Big Data North America 2016 May 2016 Vancouver, May 2016 Johannes Weigend QAware GmbH

Any Question? Ask or Twitter with the Hashtag #cloudnativenerd

The Problem We Want to Solve Interactive applications with runtimes lower than a second! Processing of billions of records (10 9-12 rows / records) Continuously import data (near realtime) Applications on top of the Reactive Manifesto

Apache Big Data North America Vancouver 05.05.2016 Johannes Weigend QAware GmbH

Horizontal Scalability can be difficult! Horizontal Scalability of functions Trivial Loadbalancing of (stateless) services (makro- / microservices) More users! more machines Not trivial More machines! faster response times Horizontal Scalability of data Trivial Linear distribution of data on multiple machines More machines! more data Not trivial Constant response times with growing datasets

Hadoop Gives Answers for Horizontal Scalability of Data and Functions

The Processing of Distributed Data can be Quite Slow! Reduce foreach() -> Minutes / Hours Map Map Map Filter Filter Filter Read Read Read Data Flow HDFS / NFS / NoSQL 9

With Former Indexing and Searching, Less Data has to be Read and Filtered. Reduce Map Map Map foreach() -> Seconds/Minutes Filter Filter Filter Data Flow Search Search Search Search / NoSQL 10

Frontend Spark Reduce Map Map Map Business Layer Cluster Processing Search Search Search Distributed Data

DEMO

Reduce Map Map Map Spark Filter Filter Filter Search Search Search Data Flow Search / NoSQL 1. Solr Cloud for Analytics

Cloud Document based NoSQL database with outstanding search capabilities A document is a collection of fields (string, number, date, ) Single und multiple fields (fields can be arrays) Nested documents Static und dynamic scheme Powerful query language (Lucene) Horizontal scalable with Solr Cloud Distributed data in separate shards Resilience by the combination of zookeeper and replication Powerful aggregations (aka facets) Stable > V 6.0

The Architecture of Solr Cloud Leader Zookeeper Zookeeper Zookeeper Zookeeper Cluster Solr Server Solr Server Solr Server Solr Cloud Collection Shard1 Shard2 Shard3 Shard4 Shard5 Shard6 Shard7 Shard8 Shard9 Shards Replica4 Replica8 Replica9 Replica7 Replica2 Replica3 Replica1 Replica5 Replica6 Replicas Scale Out

Solr Stores Everything in a Single Table (BigTable). Searching is Extremely Fast and Powerful. * Customer Order Name Address 1 * Amount Product SolrDocument SolrDocument SolrDocument SolrDocument Type ID Name Address Amount Product C2O Customer 1 K 1 A 1 - - [3,5] Customer 2 K 2 A 2 - - [4] Order 3 - - Z 1 P 1 [1] Order 4 - - Z 2 P 2 [2]... (*) With 100 million documents per shard, runtimes of queries and aggregations are normally less then 100ms

A Solr Cloud can be Started in Seconds. Create a scheme by reusing an existing set of solr config files There are examples in the installation directory $SOLR_HOME/server/solr/configsets which can be copied and modified cp $SOLR_HOME/server/solr/configsets/basic_configs \ $SOLR_HOME/server/solr/configsets/bigdata2016 Start solr When the wizzard asks for a collection name use bigdata2016 (see above) $SOLR_HOME/bin/solr start e cloud Make a first test curl localhost:8983/solr/jax2016/query?q=*:*

With the Solr Cloud Collection API, Shards can be Created, Changed or Deleted. Create a collection <<SOLR URL>>/solr/admin/collections?action=CREATE& name=<<name of collection>>& numshards=16& replicationfactor=2& maxshardspernode=8& collection.configname= <<name of uploaded zookeeper configuration>> Delete a collection <<SOLR URL>>/solr/admin/collections?action=DELETE& name=<<name of collection>> https://cwiki.apache.org/confluence/display/solr/collections+api

Zookeeper has to be Started First and the Solr Configuration must be Uploaded to Use a Solr Cloud. 1. Start zookeeper on 2n+1 nodes (odd number) $ZOO_HOME/bin/zkServer.sh start 2. Upload the solr configuration into zookeeper $SOLR_HOME/server/scripts/cloud-scripts$./zkcli.sh -cmd upconfig -zkhost 192.168.1.100:2181,192.168.1.101:2181,192.168.1.102 -confname ekgdata -solrhome /opt/solr/server/solr -confdir /opt/solr/ server/solr/configsets/ekgdata_configs/conf 3. Start solr on n-nodes connected to the zookeeper cluster $SOLR_HOME/bin/solr start c -z 192.168.1.100:2181,192.168.1.101:2181,192.168.1.102 4. Create a collection with a number of shards and replicas

Example: Solr Cloud for Analytics of Insurance Data IBM Watson insurance sample data containing the following fields Education Gender Income...

DEMO

Executing Facet Queries

Term Facets Group and Count a Single Field. Apache Big Data North America Vancouver 05.05.2016 Johannes Weigend QAware GmbH 23

Function Facets Aggregate Fields. http://yonik.com/solr-facet-functions/ 24

Pivot Facets Compose Facets to Hierarchies. 25

Solr 6 Supports SQL Solr 6 supports distributed SQL The JDBC Driver is part of the solrj client library A collection is currently mapped as single table. Collection -> Table SolrDocument -> Row Field -> Column The Solr 6.0 JDBC Driver is very limited, but more functionality is expected in upcoming versions No database metadata, no prepared statements, no mapping to tables per type field

Resilience The number of replicas per shard is configurable (replication factor) This number corresponds with the number of nodes which can silently fail Zookeeper is the single source of failure, but can also be failsafe by running multiple instances Solr knows all zookeeper instances and can silently switch over to the next available leader if last connected zookeeper crashes

You Got Everything for Analytics Applications! Or Not? Client side processing of Solr documents does not scale No possibility to run parallel business logic inside Solr with a strong separation of concerns between Solr and your code The Solr index is not a general purpose store for huge data Images Videos Binaries / large text documents No Interface to machine learning or typical statistics libraries (R)...

Reduce Map Map Map Spark Filter Filter Filter Search Search Search Data flow Search / NoSQL Distributed In-Memory Computing mit Apache Spark

READ THIS: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Distributed computing (100x faster than Hadoop (M/R) Distributed Map/Reduce on distributed data can be done in-memory Written in Scala (JVM) Java/Scala/Python APIs Processes data from distributed and non-distributed sources Textfiles (accessible from all nodes) Hadoop File System (HDFS) Databases (JDBC) Solr per Lucidworks API...

Driver Node Cluster JVM Driver Application JVM Master Host Master / Yarn / Mesos creates Spark Context Resilient Distributed Dataset RDD uses MasterURL Task Slave Slave JVM Worker JVM Worker start start JVM Executor JVM Partition Task(s) Executor Partition Task(s) Application JVM JVM Slave Worker start Executor Partition Task(s)

A Very First Spark Application

Spark Pattern 1: Distributed Task with Params

Spark Pattern 2: Distributed Read from External Sources

Spark Pattern 3: Caching and Further Processing with RDDs

DEMO

Reduce Map Map Map Spark Filter Filter Filter Search Search Search Datenfluss Search / NoSQL Putting All Together Solr & Spark in Action

How to Implement readfromshard()? There are several possibilities: SolrJ: SolrCloudStream /export Handler can stream mass data (with limitations) Supports only JSON (No binary or xml) Or: SolrJ cursor marks http://localhost:8983/solr/bigdata2016/export?q=*:*&sort=id%20asc&fl=id Or: Build your own custom export handler

LucidWorks has released a Spark/Solr Integration Library. https://github.com/lucidworks/spark-solr

1 Lucidworks Solr-Spark Adapter V 2.1 2 3 4

Logfile Analytics with Solr and Spark Histogram of exceptions from hosts A,B,C during time interval D Step 1: Search with Solr Solr Query q=*exception AND (server: A OR server:b OR server:c) AND timestamp between [1.1.2015, 31.12.2015] Step 2: Create a map key = << exception name >>, value = count

1 2 3 4 42

+ DEMO

Specifications Intel NUC6i5SYK CPU 6th generation Intel Core i5-6260u processor with Intel Iris graphics (1.9 GHz up to 2.8 GHz Turbo, Dual Core, 4 MB Cache, 15W TDP) RAM 32 GB Dual-channel DDR4 SODIMMs 1.2V, 2133 MHz DISK 256 GB Samsung M.2 internal SSD Total 8 Cores, 16 HT Units, 128 GB RAM, 1 TB Disk! This case is as powerful like four notebooks Apache Big Data North America Vancouver 05.05.2016 Johannes Weigend QAware GmbH

Technical Cluster Architecture Zeppelin Zookeeper #1 Spark Spark Master JVM #4 Slave JVM #4 Master JVM Slave JVM Executor JVM #4 Solr Cloud 4 1 Executor JVM #1 Solr Cloud s13 s14 s15 s16 s1 s2 s3 s4 Ubuntu Linux Ubuntu Linux hdfs Spark Zookeeper #3 Zeppelin Spark Zookeeper #2 Master JVM #3 Slave JVM #3 Master JVM #2 Slave JVM #2 Executor JVM #3 Solr Cloud 3 2 Executor JVM #2 Solr Cloud s9 s10 s11 s12 s5 s6 s7 s8 Ubuntu Linux Ubuntu Linux

You Can Build a Solr/Spark Cloud on Odroid 70$ Computers ODROID XU4: 8 Cores, 2GB RAM, 64 GB emmc Disk ~1/10 CPU performance in comparison to Intel NUC 6 / Core i5 Apache Big Data North America Vancouver 05.05.2016 Johannes Weigend QAware GmbH

40 Cores 10 GB RAM 320 GB emmc Disk SPARK Worker SOLR 5.3 SPARK Worker SOLR 5.3 Odroid XU4 2 GB RAM 64 GB emmc Disk Ubuntu Linux 70$ SPARK Master SPARK Worker SOLR 5.3 ZOOKEEPER SPARK Worker SOLR 5.3 SPARK Worker SOLR 5.3 47

Summary Solr Cloud / Spark are a powerful combination for interactive analytics and data intense applications Writing distributed software stays hard. Only distribute if you have to. 100% Open Source A simple integration of Solr and Spark is easy. For high performance applications things could be challenging. If professional product support is needed, customers can switch to Lucidworks Fusion to get a pre integrated and supported Solr/Spark platform

@JohannesWeigend @qaware slideshare.net/qaware blog.qaware.de

51