Search-based business intelligence and reverse data engineering with Apache Solr

Similar documents
Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Search and Real-Time Analytics on Big Data

College of Engineering, Technology, and Computer Science

XpoLog Competitive Comparison Sheet

System Requirements Table of contents

Informatica Data Director Performance

Sisense. Product Highlights.

How To Retire A Legacy System From Healthcare With A Flatirons Eas Application Retirement Solution

Oracle Identity Analytics Architecture. An Oracle White Paper July 2010

Real-World Analytics with Solr Cloud and Spark

Tips and tricks for using SAP BusinessObjects Web Intelligence with SAP BW

<Insert Picture Here> Oracle BI Standard Edition One The Right BI Foundation for the Emerging Enterprise

applications. JBoss Enterprise Application Platform

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Introduction to Oracle Business Intelligence Standard Edition One. Mike Donohue Senior Manager, Product Management Oracle Business Intelligence

Load and Performance Load Testing. RadView Software October

Oracle Data Integrator Overview and Demo. Maria Mundrova ETL Developer/DW Consultant

Take full advantage of IBM s IDEs for end- to- end mobile development

Performance Management Platform

OBIEE 11g Scaleout & Clustering

Oracle Primavera P6 Enterprise Project Portfolio Management Performance and Sizing Guide. An Oracle White Paper October 2010

Inside the Digital Commerce Engine. The architecture and deployment of the Elastic Path Digital Commerce Engine

Big Data for Investment Research Management

SharePoint 2010 vs. SharePoint 2013 Feature Comparison

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

XpoLog Center Suite Data Sheet

2012 LABVANTAGE Solutions, Inc. All Rights Reserved.

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Liferay Portal s Document Library: Architectural Overview, Performance and Scalability

Cost-Effective Business Intelligence with Red Hat and Open Source

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

SAP Crystal Reports & SAP HANA: Integration & Roadmap Kenneth Li SAP SESSION CODE: 0401

The Rise of Industrial Big Data. Brian Courtney General Manager Industrial Data Intelligence

Agile Business Intelligence Data Lake Architecture

Search Big Data with MySQL and Sphinx. Mindaugas Žukas

Beyond Conventional Data Warehousing. Florian Waas Greenplum Inc.

Software. PowerExplorer. Information Management and Platform DATA SHEET

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Using Linux Clusters as VoD Servers

Overview Western Mariusz Gieparda

Nuxeo, an open source platform for content-centric business applications. Stéfane Fermigier, Nuxeo Laurent Doguin, Nuxeo

Oracle Architecture, Concepts & Facilities

The Future of IoT. Zach Shelby VP Marketing, IoT Feb 3 rd, 2015


Databricks. A Primer

Oracle Data Integrator 11g: Integration and Administration

Oracle Big Data SQL Technical Update

Contents Introduction... 5 Deployment Considerations... 9 Deployment Architectures... 11

Enterprise Service Bus

Phire Architect Hardware and Software Requirements

Apache HBase. Crazy dances on the elephant back

Patent Public Advisory Committee Meeting PE2E Status. David Landrith Patents Portfolio Manager July 14, 2011

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

Department of Defense. Enterprise Information Warehouse/Web (EIW) Using standards to Federate and Integrate Domains at DOD

Oracle Business Activity Monitoring 11g New Features

Fusion Applications Overview of Business Intelligence and Reporting components

Databricks. A Primer

Hitachi Data Center Analytics

HDFS. Hadoop Distributed File System

Tool-Assisted Knowledge to HL7 v3 Message Translation (TAMMP) Installation Guide December 23, 2009

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Building Your Big Data Team

Case Study. Web Application for Financial & Economic Data Analysis Brainvire Infotech Pvt. Ltd Page 1 of 1

SAP Database Strategy Overview. Uwe Grigoleit September 2013

What's New in SAS Data Management

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

IBM Rational Asset Manager

ORACLE TAX ANALYTICS. The Solution. Oracle Tax Data Model KEY FEATURES

Enterprise GIS Architecture Deployment Options. Andrew Sakowicz

How to make a good Software Requirement Specification(SRS)

Ganzheitliches Datenmanagement

Jenkins User Conference Herzelia, July #jenkinsconf. Testing a Large Support Matrix Using Jenkins. Amir Kibbar HP

IT Business Management System Requirements Guide

Test Data Management Concepts

POTENTIAL DHH TECHNICAL ARCHITECTURE

Q: Which versions of Oracle BI does Primavera P6 Analytics support? A: Oracle Business Intelligence 10g

What s New with Informatica Data Services & PowerCenter Data Virtualization Edition

a division of Technical Overview Xenos Enterprise Server 2.0

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Constructing a Data Lake: Hadoop and Oracle Database United!

Integrating data in the Information System An Open Source approach

Qlik Sense Enabling the New Enterprise

100% NO CODING NO DEVELOPING IMMEDIATE BUSINESS -25% -70% UNLIMITED SCALABILITY DEVELOPMENT TIME SOFTWARE STABILITY

8. Business Intelligence Reference Architectures and Patterns

Planning the Installation and Installing SQL Server

POINT-TO-POINT vs. MEAP THE RIGHT APPROACH FOR AN INTEGRATED MOBILITY SOLUTION

Analytics Software for Energy Management and Building Systems Optimization and Equipment Fault Detection

<Insert Picture Here> Michael Hichwa VP Database Development Tools Stuttgart September 18, 2007 Hamburg September 20, 2007

OWB Users, Enter The New ODI World

Cross Platform Applications with IBM Worklight

Integrating Ingres in the Information System: An Open Source Approach

Transcription:

Search-based business intelligence and reverse data engineering with Apache Solr M a r i o - L e a n d e r R e i m e r C h i e f T e c h n o l o g i s t

This talk will o Give a brief overview of the AIR system s architecture o Show reverse data engineering using Solr and MIR o Talk about the fight for our right to Solr o Describe solutions for the problem of combinatorial explosion o Outline a flexible and lightweight ETL approach for Solr Mario-Leander Reimer 2

A <<System>> AIR Central I <<Subsystem>> Spring Framework I <<Subsystem>> JEE 5 WS Clients A <<Ext. System>> AIR Bus I <<Ext. System>> Backend Systems Vehicles Service Bulletins Independent Workshop A <<Client>> Browser Search and Display Retrofits JSF Web UI REST API Protocoll Watchlist File Storage AIR DB Document Storage Mechanic A <<Anwendungscluster>> A <<Client>> Masterdata Documents AIR Repository AIR Client I <<Subsystem>> I <<Subsystem>> Measures Solr Access Apache Solr.NET WPF Solr Extensions Defects Maintenance Launch Parts Flat Rates 20 Languages 800 GB Solr Index Service Technician A <<Ext. System>> 3rd Party Application AIR Fork DLL AIR Call DLL Query A <<Application Cluster>> AIR Loader I <<Subsystem>> Spring Framework Vehicles Service Bulletins Load Execute A <<System>> AIR Control I <<Subsystem>> Jenkins Maintenance Defects A <<Ext. System>> 3rd Party ios App Flat Rates Documents AIR ios Lib Repair Overview... Backend Databases and Systems Mario-Leander Reimer 3

A <<Anwendungscluster>> AIR Repository 20 Languages 800 GB Solr Index I <<Subsystem>> Apache Solr Slave Solr Extensions Replicate I <<Subsystem>> Apache Solr Master Solr Extensions 20 Languages 800 GB Solr Index A <<Application Cluster>> AIR Loader A <<System>> AIR Control I <<Subsystem>> Spring Framework Load I <<Subsystem>> Jenkins Vehicles Service Bulletins Execute Maintenance Defects Flat Rates Repair Overview Documents... Backend Databases and Systems Search A <<System>> MIR Mario-Leander Reimer 4

Let s go back to when it all began Source: http://www.october212015.com/images/timecircuits.jpg 5

The project vision: find the right information in less than 3 clicks. The situation: o Users had to use up to 7 different applications for their daily work. o Systems were not really integrated nicely. o Finding the correct information was laborious and error prone. The idea: o Combine the data into a consistent information network. o Make the information network and its data searchable and navigable. o Replace existing application with one easy to use application. Mario-Leander Reimer 6

But how do we find the originating system for the desired data? System A System B System C System D Where to find the vehicle data? Vehicle data 60 potential systems and 5000 entities. Other data Mario-Leander Reimer 7

And how do we find the hidden relations between the systems and their data? System A System B System C System D How is the data linked to each other? Customer Vehicle 400.000 potential relations. Documents Other data Mario-Leander Reimer 8

Meta Information Research (MIR) Source: http://www.thewallpapers.org/photo/31865/mir_space_station_12_june_1998.jpg 9

MIR is a simple and lightweight data reverse engineering and analysis tool based on Solr. o MIR manages meta information about the source systems (the data models and record descriptions) o MIR allows to navigate and search in the metadata, you can drill into the metadata using facets o MIR also manages the target data model and Solr schema description A <<System>> Meta Information Research 25MB MIR User Interface Metadata Index I <<Subsystem>> Apache Solr MIR Loader Read Backend Databases and Systems Sources (Java, XML) Magic Draw MIR Generators Mario-Leander Reimer 10

Facetted drill down Wildcard queries Search results Tree view of systems, tables and attributes Found potential synonyms for the chassis number 11

Solr entities for each release Solr schema attributes EAT YOUR OWN DOG FOOD. The AIR Solr schema definition is modelled and defined within MIR. 12

def sourcegenerator = MIR + Solr + Maven; Mario-Leander Reimer 13

But Solr is a full text search engine. You have to use an Oracle DB for your application data! NO! 14

Some of the AIR requirements were... o Focus is on search. Transactions are not required. o High demands on request volume and performance. o Free navigation on data model and content. o Support for full text search and facetted search. o Offline capabilities. o Scalability from low-end device to server to cloud. Mario-Leander Reimer 15

Apache Solr outperformed Oracle significantly in query time as well as index size. SELECT * FROM VEHICLE WHERE VIN='V%' INFO_TYPE:VEHICLE AND VIN:V* 383 ms 384 ms 383 ms 038 ms 000 ms 000 ms SELECT * FROM VEHICLE WHERE VIN='%X%' INFO_TYPE:VEHICLE AND VIN:*X* 389 ms 387 ms 386 ms 092 ms 000 ms 000 ms SELECT * FROM MEASURE WHERE TEXT='engine' INFO_TYPE:MEASURE AND TEXT:engine 859 ms 379 ms 383 ms 039 ms 000 ms 000 ms Test data set: 150.000 records Disk space: 132 MB Solr vs. 385 MB Oracle Mario-Leander Reimer 16

Dirt Race Use Case: o Low-end devices o No Internet 28. Source: September http://www.dirtbikerider.com/news/images/anotherimpressivegpweekendforhusqvarna_553db21addaaa.jpg 2015 17

Running Solr and AIR-2-Go on Raspberry Pi Model B worked like a charm. Running Debian Linux + JDK8 Jetty Container with the Solr and AIR WARs deployed Reduced Solr data set with approx ~1.5 Mio documents Model B Hardware Specs: o ARMv6 CPU at 700Mhz o 512MB RAM o 32GB SD-Card Mario-Leander Reimer 18

YOU GOTTA FIGHT FOR YOUR RIGHT TO SOLR! 19

No silver bullet. A careful schema design is crucial for your Solr performance. Mario-Leander Reimer 20

Naive data denormalization can quickly lead to combinatorial explosion. m 5.078.411 FRU Groups 14.830.197 Flat Rate Units m 1.678.667 Packages n n n 648.129 Measures 18.573 Repair Instructions 33.071.137 Vehicles m m n n 41.385 Types m n n 648.129 Technical Documents 55.000 Parts n 6.180 Fault Indications Num Docs: 55.777.706 Relationship Navigation Mario-Leander Reimer 21

Multi-valued fields can efficiently store 1..n relations but may result in false positives. { } "INFO_TYPE":"AWPOS_GROUP", "NUMMER" :[ "1134190", "1235590" ] "BAUSTAND" :["1969-12-31T23:00:00Z","1975-12-31T23:00:00Z"] "E_SERIES" :[ "F10", "E30" ] Index 0 Index 1 q=info_type:awpos_group AND NUMMER:1134190 AND E_SERIES:F10 q=info_type:awpos_group AND NUMMER:1134190 AND E_SERIES:E30 In case this doesn t matter, perform a post filtering in your application. Note: latest Solr versions support nested child documents. Use instead. Mario-Leander Reimer 22

Technical documents and their validity were expressed in a binary representation. o Validity expressions may have up to 46 characteristics. o Validity expressions use 5 different boolean operators (AND, NOT, ) o Validity expessions can be nested and complex. o Some characteristics are dynamic and not even known at index time. Solution: transform the validity expressions into the equivalent JavaScript terms and evaluate these terms at query time using a custom function query filter. Mario-Leander Reimer 23

Binary validity expression example. Type(53078923) = Brand, Value(53086475) = BMW PKW Type(53088651) = E-Series, Value(53161483) = F10 Type(64555275) = Transmission, Value(53161483) = MECH 24

Transformation of binary validity terms into their JavaScript equivalent at index time. AND(Brand='BMW PKW', E-Series='F10' Transmission='MECH') ((BRAND=='BMW PKW')&&(E_SERIES=='F10')&&(TRANSMISSION=='MECH')) { "INFO_TYPE": "TECHNISCHES_DOKUMENT", "DOKUMENT_TITEL": "Getriebe aus- und einbauen", "DOKUMENT_ART": " reparaturanleitung", "VALIDITY": "((BRAND=='BMW PKW')&&((E_SERIES=='F10')&&(...))", BRAND": [ BMW PKW"],... } Mario-Leander Reimer 25

Base64 decode The JavaScript validity term is evaluated at query time using a custom function query. &fq=info_type:technisches_dokument &fq=dokument_art:reparaturanleitung &fq={!frange l=1 u=1 incl=true incu=true cache=false cost=500} jsterm(validity,eyjnt1rpul9lukfgvfnut0zgqvjux01pve9sqvjcruluu 1ZFUkZBSFJFTiI6IkIiLCJFX01BU0NISU5FX0tSQUZUU1RPRkZBUlQiOm51bG wsilnjq0hfukhfsvrtrkfiulpfvucioiiwiiwiqu5uuklfqii6ikfxrcisikv kjbvvjfsuhfijoiwccifq==) { BRAND":"BMW PKW", "E_SERIES":"F10", "TRANSMISSION":"MECH" } http://qaware.blogspot.de/2014/11/how-to-write-postfilter-for-solr-49.html Mario-Leander Reimer 26

How often do we load data? How do we ensure data consistency? 27

A traditional approach using a DWH and ETL: too inflexible, heavy weight and expensive. System A File DB ETL ETL jobs would usually be implemented with Informatica System B ETL ETL Data Warehouse ETL AIR Solr ETL System C ETL File DB Significant business logic required depending on the source database Mario-Leander Reimer 28

Start Build Run Flexible and lightweight ETL combined with Continuous Delivery and DevOps. H <<System>> AIR Loader Slave H <<System>> AIR Search Developer Solr Index Solr Index Build & Deploy I <<System>> Apache Solr A <<System>> AIR Loader Load Replicate I <<System>> Apache Solr I <<System>> Nexus Repository I <<System>> Apache Maven Extract I <<System>> Jenkins Slave Data Source A Operations I <<System>> Jenkins Master Data Source n Mario-Leander Reimer 29

30

Apache Solr has become a powerful tool for data analytics applications. Be creative. Our next big project using Apache Solr is already on its way. High performance application to predict and calculate the bill of materials for all required parts and orders. Apache Solr as a compressed, scalable and high performance time series database. FOSDEM 15 Florian Lautenschlager, QAware GmbH Leveraging the Power of SOLR and SPARK Apache: Big Data 2015 Johannes Weigend, QAware GmbH Mario-Leander Reimer 31

Business intelligence is about asking the right questions about your data. 32

And with Apache Solr you can search and find the answers you are looking for. 33

& Mario-Leander Reimer Chief Technologist, QAware GmbH https://twitter.com/leanderreimer/ https://slideshare.net/marioleanderreimer/ https://speakerdeck.com/lreimer/