How Lucene Powers LinkedIn Segmentation & Targeting Platform Lucene/SOLR Revolution EU, November 2013 Hien Luu, Raj Rangaswamy
About Us * Hien Luu Rajasekaran Rangaswamy
Agenda Little bit about LinkedIn Segmentation & Targeting Platform Overview How Lucene powers Segmentation & Targeting Platform Q&A
Our Vision Create economic opportunity for every professional in the world. Our Mission Connect the world s professionals to make them more productive and successful. Members First!
The world s largest professional network Over 65% of members are now international >30M >90% Fortune 100 Companies use LinkedIn Talent Soln to hire >3M Company Pages 19 Languages >5.7B Professional searches in 2012
Other Company Facts Headquartered in Mountain View, Calif., with offices around the world! LinkedIn has ~4200 full- Kme employees * located around the world Source : http://press.linkedin.com/about
SegmentaKon & TargeKng
Segmentation & Targeting
Segmentation & Targeting Attribute types Bhaskar Ghosh
Segmentation & Targeting 1. Create attributes Name Email State Occupation Etc. 2. Attributes Added to Table Name Email State OccupaEon John Smith jsmith@blah.com California Engineer Jane Smith smithj@mail.com Nevada HR Manager Jane Doe jdoe@email.com California Engineer 3. Create Target Segment: California, Engineer Name Email State OccupaEon 4. Export List & Send Vendor John Smith jsmith@blah.com California Engineer Jane Doe jdoe@email.com California Engineer LinkedIn Confidential 2013 All Rights Reserved 10
Segmentation & Targeting Business definition Business would like to launch new campaign often Business would like to specify targeting criteria using arbitrary set of attributes Attributes need to be computed to fulfill the targeting criteria The attribute data resides on Hadoop or TD Business is most comfortable with SQL-like language
Segmentation & Targeting Attribute Computation Engine Attribute Serving Engine
Segmentation & Targeting Self-service Attribute consolidation Attribute Computation Engine Support various data sources Attribute availability
PB Segmentation & Targeting Attribute computation TB ~238M TB ~440
Segmentation & Targeting Self-service Build segments Attribute Serving Engine Attribute predicate expression Build lists
Segmentation & Targeting count filter sum 1234 $ Σ complex expressions Serving Engine ~238M ~440 LinkedIn Member Attribute table
LinkedIn Segmentation & Targeting Platform Who are the job seekers? Who are the LinkedIn Talent Solution prospects in Europe? Who are north American recruiters that don t work for a competitor?
LinkedIn Segmentation & Targeting Platform Complex tree-like attribute predicate expressions
Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
Architecture Attribute Serving Engine Attribute Indexing Attribute Serving Engine Attribute Computation Engine Attribute Creation Engine Attribute Materialization Engine Attribute Metastore Data Storage Layer
Indexer mysql attribute store Attribute Definitions HDFS Mapper K=> AvroKey<GenericRecord> V=> AvroValue<NullWritable> Avro data in HDFS Hadoop Indexer MR shard 1 shard 2 Reducer K=> NullWritable V=> LuceneDocumentWrapper Index Merger shard n LuceneOutputFormat RecordWriter LuceneDocumentWrapper Document Web Servers Index
Serving JSON Predicate Expression JSON Lucene Query Parser Segment & List Inverted Index Inverted Index Inverted Index
Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
Serving Load Balanced Model HTTP Request Load Balancer Web Server 1 Web Server 2 Web Server n Shard 1 Shard 2 Shard n Shared Drive
Serving Load Balanced Model But Wait.. Is load balancing alone good enough? What about distribution and failover?
Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
Next Steps - Distributed Model A generic cluster management framework Used to manage partitioned and replicated resources in distributed systems Built on top of Zookeeper that hides the complexity of ZK primitives Provides distributed features such as leader election, twophase commit etc. via a model of state machine http://helix.incubator.apache.org/
Next Steps - Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 active Shard 2 standby Shard 3 standby Shard 1 standby
Next Steps - Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 failure Shard 2 standby Shard 3 active Shard 1 failure
Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
DocValues Use Case Once segments are built, users want to forecast, see a target revenue projection for the campaigns that they want to run. Campaigns can be run on various Revenue Models This involves adding per member Propensity Scores and Dollar Amounts
DocValues Why not Stored Fields? Why not use Stored Fields? Stored fields have one indirection per document resulting in two disk seeks per document Performance cost quickly adds up when fetching millions of documents.fdx.fdt Document ID fetch filepointer to field data scan by id until field is found
DocValues Why not Field Cache? Why not use Field Cache? Is memory resident Works fine when there is enough memory But keeping millions of un-inverted values in memory is impossible Additional cost to parse values (from String and to String)
DocValues Dense column based storage (1 Value per Document and 1 Column per field and segment) Accepts primitives No conversion from/to String needed Loads 80x-100x faster than building a FieldCache All the work is done during Indexing DocValue fields can be indexed and stored too
Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
Lessons Learnt Indexing Reuse index writers, field and document instances Create many partitions and Merge them in a different process Rebuild (bootstrap) entire index if possible Use partial updates with caution Analyze the index Serving Reuse a single instance of IndexSearcher Limit usage of stored fields and term vectors Plan for load balancing and failover Cache term frequencies Use different machines for Serving and indexing
Agenda Architecture Indexer Architecture Serving Architecture Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
Why not use an existing solution? Doesn t allow dynamic schema Difficult to bootstrap indexes built in hadoop Indexing elevates query latency Doesn t allow dynamic schema Difficult to bootstrap indexes built in hadoop Larger memory overhead Comparatively slow
Questions? More info: data.linkedin.com