Search and Real-Time Analytics on Big Data



Similar documents
Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Using Apache Solr for Ecommerce Search Applications

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Implement Hadoop jobs to extract business value from large and varied data sets

Integrating Big Data into the Computing Curricula

BIG DATA TRENDS AND TECHNOLOGIES

Making Sense of Big Data in Insurance

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

XpoLog Competitive Comparison Sheet

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Oracle Database 12c Plug In. Switch On. Get SMART.

Client Overview. Engagement Situation. Key Requirements

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Big Data: Beyond the Hype

Real Time Big Data Processing

NoSQL Roadshow Berlin Kai Spichale

So What s the Big Deal?

Apache HBase. Crazy dances on the elephant back

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Introduction to Polyglot Persistence. Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace

Comprehensive Analytics on the Hortonworks Data Platform

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Ganzheitliches Datenmanagement

A Performance Analysis of Distributed Indexing using Terrier

The 4 Pillars of Technosoft s Big Data Practice

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Data Integration Checklist

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Luncheon Webinar Series May 13, 2013

Testing 3Vs (Volume, Variety and Velocity) of Big Data

The Future of Data Management

BIG DATA What it is and how to use?

An Approach to Implement Map Reduce with NoSQL Databases

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data on Microsoft Platform

Time series IoT data ingestion into Cassandra using Kaa

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Data Modeling for Big Data

Chapter 7. Using Hadoop Cluster and MapReduce

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

How To Scale Out Of A Nosql Database

Hadoop IST 734 SS CHUNG

CloudSearch: A Custom Search Engine based on Apache Hadoop, Apache Nutch and Apache Solr

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

Big Data Analytics Nokia

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Choosing The Right Big Data Tools For The Job A Polyglot Approach

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

BIG DATA-AS-A-SERVICE

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Introduction to Multi-Data Center Operations with Apache Cassandra, Hadoop, and Solr WHITE PAPER

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Oracle Big Data SQL Technical Update

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Introduction to Big Data Training

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Manifest for Big Data Pig, Hive & Jaql

Big Data? Definition # 1: Big Data Definition Forrester Research

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Hadoop Ecosystem B Y R A H I M A.

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Technical Overview Simple, Scalable, Object Storage Software

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data and Data Science: Behind the Buzz Words

NoSQL for SQL Professionals William McKnight

Workshop on Hadoop with Big Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Native Connectivity to Big Data Sources in MSTR 10

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Getting Started Practical Input For Your Roadmap

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

CitusDB Architecture for Real-Time Big Data

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Data processing goes big

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Transcription:

Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012

Big Data: data becomes your core asset. It realizes its value when you know how to do what Data Volume, Velocity, Variety Database, Web, Blogs, Social media, Sensor data, smart meter data, log files Architecture Scalable, Agile, Resilient, Affordable Hadoop, NoSQL, NewSQL, MPP Analytics Know your customers better, Know your business better, Know your competitors better Recommender system, Operational intelligence, Competitive intelligence Copyright 2012 Accenture All rights reserved. 2

What does that mean? Let s go with an example: Vehicle Telematics Source: http://www.iso.com/auto-underwriting/improving-claims-underwriting-with-telematics-demo.html Copyright 2012 Accenture All rights reserved. 3

Big Data and Search You often find it useful to have search capability when building service layer down the road. Discover Process Store Analyze Deliver Copyright 2012 Accenture All rights reserved. 4

Search basics Search engine responds to queries NOT from raw documents but from pre-built index. Batch 1. Crawl/import raw documents 2. Extract terms by tokenizing/filtering/stemming 3. Build index structure Search 1. Parse the query 2. Look up index 3. Retrieve documents 4. Rank, cluster, and present Raw data Index Doc ID Content Term Termdocs doc0 doc1 doc2 doc3 the brief for big data will be given to those who for writing massively parallel big data analytics, and Apache Solr for real-time Big Data analytics, conferences on various big data and other programming Copyright 2012 Accenture All rights reserved. 5 Apache big data doc2 doc0, doc1, doc2, doc3 * text body from StrataNY 2012 tutorial abstracts

Search applications on unstructured data Text Search Extract terms from documents Build an index that maps a term to a list of term documents Present the search results (PageRank, clustering) Faceted Search Popular in online retailers Extract metadata (e.g., price, color, brand) from documents Use the metadata to interactively filter out search results Copyright 2012 Accenture All rights reserved. 6

Other search applications Geo location search (near-by points of interest) Geo-aware (context-aware) text search Google image search Copyright 2012 Accenture All rights reserved. 7

Big Data makes search interesting in two ways. Data volume calls for distributed search Index size grows with data Sorted index is expensive to update after certain size Distributed search enables index sharding It comes with distributed system complexity Data velocity calls for real time search Real time contents have a short life cycle They have less terms to index Their context is important but hard to capture It calls for alternatives to traditional batch index building and result ranking techniques Copyright 2012 Accenture All rights reserved. 8

What are available distributed search technologies? Lucene-based Custom engine Copyright 2012 Accenture All rights reserved. 9

Apache Lucene and Solr Solr: Lucene = Car : its engine Kevin Tan (Supermind Consulting) Lucene Open source information retrieval library Created in Java, ported to many other languages Maintains document-text indices and provides search capability Solr Open source enterprise search platform Full-text search server, written in Java Provides distributed search and index replication Source: Grant Ingersoll (LucidWorks) SolrCloud a set of new distributed capabilities in Solr Maintains shard and replication states using Zookeeper Source: http://wiki.apache.org/solr/solrcloud Copyright 2012 Accenture All rights reserved. 10

Commercial offerings based on Lucene/Solr Focus on core search capability - Extensive portfolio of data connectors - Inherit Solr Cloud s distributed system architecture - Convenience layer (e.g., GUI for cluster configuration) Making search capability setup simple - Free search schema - Use JSON over HTTP interface - Use Lucene, but NOT Solr. It built its own distributed system - Automatic shard allocation and migration - facets, highlighting, custom scripts features One integrated system for both DB and search - Use Cassandra in place of Solr Cloud - Store index segment locally keeping LuceneIndexFormat leverage Lucene performance - Store raw field data in Cassandra sync with raw data, durability + availability of Solr * Different from Solandra Copyright 2012 Accenture All rights reserved. 11

Other offerings based on custom search engines Fully-managed search service in the cloud - Leverages A9.com search technology - Users 1) Create a search domain 2) Configure your search fields 3) Upload your data for indexing, and 4) Submit search requests from your web site or application - Auto scaling & fault handling behind the scene A distributed, full-text search engine that is built on Riak Core - Index Riak KV objects as they're written using a precommit hook. - Support various MIME type s (e.g., JSON, XML, text) - facets, highlighting, custom scripts features - Focus on real-time aspect - Term-based partitioning (a.k.a, global indexing) Makes Hadoop, S3 & HBase searchable - Builds indexes using Hadoop MapReduce - Stores indexes on HDFS, S3 or HBase - Query with a thin client runtime - Cloud (AWS) or On-Prem deployment options - Beta release stage Copyright 2012 Accenture All rights reserved. 12

Thank you! Sewook Wee R&D Manager Data and Platforms R&D Accenture Technology Labs (408) 817-2153 Sewook.wee@accenture.com Copyright 2012 Accenture All rights reserved. 13