How Lucene Powers LinkedIn Segmentation & Targeting Platform

Size: px

Start display at page:

Download "How Lucene Powers LinkedIn Segmentation & Targeting Platform"

Grace Rich
10 years ago
Views:

1 How Lucene Powers LinkedIn Segmentation & Targeting Platform Lucene/SOLR Revolution EU, November 2013 Hien Luu, Raj Rangaswamy

2 About Us * Hien Luu Rajasekaran Rangaswamy

3 Agenda Little bit about LinkedIn Segmentation & Targeting Platform Overview How Lucene powers Segmentation & Targeting Platform Q&A

4 Our Vision Create economic opportunity for every professional in the world. Our Mission Connect the world s professionals to make them more productive and successful. Members First!

5 The world s largest professional network Over 65% of members are now international >30M >90% Fortune 100 Companies use LinkedIn Talent Soln to hire >3M Company Pages 19 Languages >5.7B Professional searches in 2012

Companies use LinkedIn Talent Soln to hire >3M

6 Other Company Facts Headquartered in Mountain View, Calif., with offices around the world! LinkedIn has ~4200 full- Kme employees * located around the world Source :

7 SegmentaKon & TargeKng

8 Segmentation & Targeting

9 Segmentation & Targeting Attribute types Bhaskar Ghosh

10 Segmentation & Targeting 1. Create attributes Name State Occupation Etc. 2. Attributes Added to Table Name State OccupaEon John Smith California Engineer Jane Smith Nevada HR Manager Jane Doe California Engineer 3. Create Target Segment: California, Engineer Name State OccupaEon 4. Export List & Send Vendor John Smith California Engineer Jane Doe California Engineer LinkedIn Confidential 2013 All Rights Reserved 10

com Nevada HR Manager Jane Doe jdoe@email.com California Engineer 3.

11 Segmentation & Targeting Business definition Business would like to launch new campaign often Business would like to specify targeting criteria using arbitrary set of attributes Attributes need to be computed to fulfill the targeting criteria The attribute data resides on Hadoop or TD Business is most comfortable with SQL-like language

of attributes Attributes need to be computed to fulfill the targeting criteria The

12 Segmentation & Targeting Attribute Computation Engine Attribute Serving Engine

13 Segmentation & Targeting Self-service Attribute consolidation Attribute Computation Engine Support various data sources Attribute availability

14 PB Segmentation & Targeting Attribute computation TB ~238M TB ~440

15 Segmentation & Targeting Self-service Build segments Attribute Serving Engine Attribute predicate expression Build lists

16 Segmentation & Targeting count filter sum 1234 $ Σ complex expressions Serving Engine ~238M ~440 LinkedIn Member Attribute table

17 LinkedIn Segmentation & Targeting Platform Who are the job seekers? Who are the LinkedIn Talent Solution prospects in Europe? Who are north American recruiters that don t work for a competitor?

18 LinkedIn Segmentation & Targeting Platform Complex tree-like attribute predicate expressions

19 Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?

20 Architecture Attribute Serving Engine Attribute Indexing Attribute Serving Engine Attribute Computation Engine Attribute Creation Engine Attribute Materialization Engine Attribute Metastore Data Storage Layer

Computation Engine Attribute Creation Engine

21 Indexer mysql attribute store Attribute Definitions HDFS Mapper K=> AvroKey<GenericRecord> V=> AvroValue<NullWritable> Avro data in HDFS Hadoop Indexer MR shard 1 shard 2 Reducer K=> NullWritable V=> LuceneDocumentWrapper Index Merger shard n LuceneOutputFormat RecordWriter LuceneDocumentWrapper Document Web Servers Index

22 Serving JSON Predicate Expression JSON Lucene Query Parser Segment & List Inverted Index Inverted Index Inverted Index

23 Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?

24 Serving Load Balanced Model HTTP Request Load Balancer Web Server 1 Web Server 2 Web Server n Shard 1 Shard 2 Shard n Shared Drive

25 Serving Load Balanced Model But Wait.. Is load balancing alone good enough? What about distribution and failover?

26 Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?

27 Next Steps - Distributed Model A generic cluster management framework Used to manage partitioned and replicated resources in distributed systems Built on top of Zookeeper that hides the complexity of ZK primitives Provides distributed features such as leader election, twophase commit etc. via a model of state machine

28 Next Steps - Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 active Shard 2 standby Shard 3 standby Shard 1 standby

29 Next Steps - Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 failure Shard 2 standby Shard 3 active Shard 1 failure

30 Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?

31 DocValues Use Case Once segments are built, users want to forecast, see a target revenue projection for the campaigns that they want to run. Campaigns can be run on various Revenue Models This involves adding per member Propensity Scores and Dollar Amounts

32 DocValues Why not Stored Fields? Why not use Stored Fields? Stored fields have one indirection per document resulting in two disk seeks per document Performance cost quickly adds up when fetching millions of documents.fdx.fdt Document ID fetch filepointer to field data scan by id until field is found

33 DocValues Why not Field Cache? Why not use Field Cache? Is memory resident Works fine when there is enough memory But keeping millions of un-inverted values in memory is impossible Additional cost to parse values (from String and to String)

34 DocValues Dense column based storage (1 Value per Document and 1 Column per field and segment) Accepts primitives No conversion from/to String needed Loads 80x-100x faster than building a FieldCache All the work is done during Indexing DocValue fields can be indexed and stored too

35 Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?

36 Lessons Learnt Indexing Reuse index writers, field and document instances Create many partitions and Merge them in a different process Rebuild (bootstrap) entire index if possible Use partial updates with caution Analyze the index Serving Reuse a single instance of IndexSearcher Limit usage of stored fields and term vectors Plan for load balancing and failover Cache term frequencies Use different machines for Serving and indexing

37 Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?

38 Why not use an existing solution? Doesn t allow dynamic schema Difficult to bootstrap indexes built in hadoop Indexing elevates query latency Doesn t allow dynamic schema Difficult to bootstrap indexes built in hadoop Larger memory overhead Comparatively slow

39 Questions? More info: data.linkedin.com

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services