How Lucene Powers LinkedIn Segmentation & Targeting Platform



Similar documents
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Integration of Apache Hive and HBase

Search and Real-Time Analytics on Big Data

Oracle Big Data SQL Technical Update

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Large Scale Text Analysis Using the Map/Reduce

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

The Hadoop Eco System Shanghai Data Science Meetup

Complete Java Classes Hadoop Syllabus Contact No:

Apache HBase. Crazy dances on the elephant back

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

I/O Considerations in Big Data Analytics

What s New with Search in Alfresco 5. Mike Farman Alfresco Product Manager Andy Hind Alfresco Senior Engineer

Big Fast Data Hadoop acceleration with Flash. June 2013

Building Scalable Applications Using Microsoft Technologies

Introduction to Hadoop

A Performance Analysis of Distributed Indexing using Terrier

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Mark Bennett. Search and the Virtual Machine

Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Extending Hadoop beyond MapReduce

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Certified Big Data and Apache Hadoop Developer VS-1221

Sharding with postgres_fdw

Data Pipeline with Kafka

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

MS SQL Performance (Tuning) Best Practices:

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

MySQL and Hadoop. Percona Live 2014 Chris Schneider

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Apache Sentry. Prasad Mujumdar

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

HADOOP MOCK TEST HADOOP MOCK TEST I

Operations and Big Data: Hadoop, Hive and Scribe. Zheng 铮 9 12/7/2011 Velocity China 2011

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

MapReduce with Apache Hadoop Analysing Big Data

Database Scalability and Oracle 12c

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Using Apache Solr for Ecommerce Search Applications

Unified Big Data Processing with Apache Spark. Matei

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Why MySQL beats MongoDB

HADOOP PERFORMANCE TUNING

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Optimize Oracle Business Intelligence Analytics with Oracle 12c In-Memory Database Option

The Revival of Direct Attached Storage for Oracle Databases

Big Data and Scripting Systems beyond Hadoop

Using distributed technologies to analyze Big Data

Some quick definitions regarding the Tablet states in this state machine:

Time series IoT data ingestion into Cassandra using Kaa

Integrating Apache Spark with an Enterprise Data Warehouse

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

LARGE-SCALE DATA STORAGE APPLICATIONS

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

the missing log collector Treasure Data, Inc. Muga Nishizawa

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Accessing Your Database with JMP 10 JMP Discovery Conference 2012 Brian Corcoran SAS Institute

Performance Optimization For Operational Risk Management Application On Azure Platform

Building a Flash Fabric

Apache Hadoop FileSystem and its Usage in Facebook

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Big Data with Component Based Software

Can the Elephants Handle the NoSQL Onslaught?

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Integrating Big Data into the Computing Curricula

Hadoop Big Data for Processing Data and Performing Workload

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Hadoop Ecosystem B Y R A H I M A.

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop Job Oriented Training Agenda

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Introduction to Hadoop

Distributed Storage Systems

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Introduction to Hbase Gkavresis Giorgos 1470

Transcription:

How Lucene Powers LinkedIn Segmentation & Targeting Platform Lucene/SOLR Revolution EU, November 2013 Hien Luu, Raj Rangaswamy

About Us * Hien Luu Rajasekaran Rangaswamy

Agenda Little bit about LinkedIn Segmentation & Targeting Platform Overview How Lucene powers Segmentation & Targeting Platform Q&A

Our Vision Create economic opportunity for every professional in the world. Our Mission Connect the world s professionals to make them more productive and successful. Members First!

The world s largest professional network Over 65% of members are now international >30M >90% Fortune 100 Companies use LinkedIn Talent Soln to hire >3M Company Pages 19 Languages >5.7B Professional searches in 2012

Other Company Facts Headquartered in Mountain View, Calif., with offices around the world! LinkedIn has ~4200 full- Kme employees * located around the world Source : http://press.linkedin.com/about

SegmentaKon & TargeKng

Segmentation & Targeting

Segmentation & Targeting Attribute types Bhaskar Ghosh

Segmentation & Targeting 1. Create attributes Name Email State Occupation Etc. 2. Attributes Added to Table Name Email State OccupaEon John Smith jsmith@blah.com California Engineer Jane Smith smithj@mail.com Nevada HR Manager Jane Doe jdoe@email.com California Engineer 3. Create Target Segment: California, Engineer Name Email State OccupaEon 4. Export List & Send Vendor John Smith jsmith@blah.com California Engineer Jane Doe jdoe@email.com California Engineer LinkedIn Confidential 2013 All Rights Reserved 10

Segmentation & Targeting Business definition Business would like to launch new campaign often Business would like to specify targeting criteria using arbitrary set of attributes Attributes need to be computed to fulfill the targeting criteria The attribute data resides on Hadoop or TD Business is most comfortable with SQL-like language

Segmentation & Targeting Attribute Computation Engine Attribute Serving Engine

Segmentation & Targeting Self-service Attribute consolidation Attribute Computation Engine Support various data sources Attribute availability

PB Segmentation & Targeting Attribute computation TB ~238M TB ~440

Segmentation & Targeting Self-service Build segments Attribute Serving Engine Attribute predicate expression Build lists

Segmentation & Targeting count filter sum 1234 $ Σ complex expressions Serving Engine ~238M ~440 LinkedIn Member Attribute table

LinkedIn Segmentation & Targeting Platform Who are the job seekers? Who are the LinkedIn Talent Solution prospects in Europe? Who are north American recruiters that don t work for a competitor?

LinkedIn Segmentation & Targeting Platform Complex tree-like attribute predicate expressions

Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?

Architecture Attribute Serving Engine Attribute Indexing Attribute Serving Engine Attribute Computation Engine Attribute Creation Engine Attribute Materialization Engine Attribute Metastore Data Storage Layer

Indexer mysql attribute store Attribute Definitions HDFS Mapper K=> AvroKey<GenericRecord> V=> AvroValue<NullWritable> Avro data in HDFS Hadoop Indexer MR shard 1 shard 2 Reducer K=> NullWritable V=> LuceneDocumentWrapper Index Merger shard n LuceneOutputFormat RecordWriter LuceneDocumentWrapper Document Web Servers Index

Serving JSON Predicate Expression JSON Lucene Query Parser Segment & List Inverted Index Inverted Index Inverted Index

Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?

Serving Load Balanced Model HTTP Request Load Balancer Web Server 1 Web Server 2 Web Server n Shard 1 Shard 2 Shard n Shared Drive

Serving Load Balanced Model But Wait.. Is load balancing alone good enough? What about distribution and failover?

Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?

Next Steps - Distributed Model A generic cluster management framework Used to manage partitioned and replicated resources in distributed systems Built on top of Zookeeper that hides the complexity of ZK primitives Provides distributed features such as leader election, twophase commit etc. via a model of state machine http://helix.incubator.apache.org/

Next Steps - Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 active Shard 2 standby Shard 3 standby Shard 1 standby

Next Steps - Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 failure Shard 2 standby Shard 3 active Shard 1 failure

Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?

DocValues Use Case Once segments are built, users want to forecast, see a target revenue projection for the campaigns that they want to run. Campaigns can be run on various Revenue Models This involves adding per member Propensity Scores and Dollar Amounts

DocValues Why not Stored Fields? Why not use Stored Fields? Stored fields have one indirection per document resulting in two disk seeks per document Performance cost quickly adds up when fetching millions of documents.fdx.fdt Document ID fetch filepointer to field data scan by id until field is found

DocValues Why not Field Cache? Why not use Field Cache? Is memory resident Works fine when there is enough memory But keeping millions of un-inverted values in memory is impossible Additional cost to parse values (from String and to String)

DocValues Dense column based storage (1 Value per Document and 1 Column per field and segment) Accepts primitives No conversion from/to String needed Loads 80x-100x faster than building a FieldCache All the work is done during Indexing DocValue fields can be indexed and stored too

Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?

Lessons Learnt Indexing Reuse index writers, field and document instances Create many partitions and Merge them in a different process Rebuild (bootstrap) entire index if possible Use partial updates with caution Analyze the index Serving Reuse a single instance of IndexSearcher Limit usage of stored fields and term vectors Plan for load balancing and failover Cache term frequencies Use different machines for Serving and indexing

Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?

Why not use an existing solution? Doesn t allow dynamic schema Difficult to bootstrap indexes built in hadoop Indexing elevates query latency Doesn t allow dynamic schema Difficult to bootstrap indexes built in hadoop Larger memory overhead Comparatively slow

Questions? More info: data.linkedin.com