1 White Paper Crowd Sourcing Reflected Intelligence Using Search and Big Data How LucidWorks and MapR can reflect crowd-source intelligence by leveraging Lucene/Solr A White PAPer by Grant ingersoll, Chief Scientist for LucidWorks and ted Dunning, Chief Application Architect for Mapr
2 Page 2 LucidWorks & MapR: Crowd-Sourced Intelligence Abstract This white paper explores how search has evolved in recent years beyond keyword search into a more broadly applicable information discovery tool by using principles of reflected intelligence. The paper will then demonstrate how several organizations combine big data, search and reflected intelligence to improve search results and decision-making. The paper concludes with a discussion of how LucidWorks and MapR work together to make this possible and how organizations can get started using reflected intelligence in their search applications. The Evolution of Search Search has become a mainstream and integral part of our daily lives it is helpful to remember, however, that it wasn t always this way. In the early days of the Internet, tools like Archie, Veronica, and Jughead emerged to search for particular file names stored on FTP servers and Gopher listings. Once the World Wide Web was established with the release of the first browser and server code from CERN in 1992, search engines like WebCrawler, Lycos, Magellan, Excite, Infoseek, Inktomi, Northern Light, and AltaVista emerged to help unlock the information stored across, what was at the time, thousands of Web servers with perhaps hundreds of thousands of pages of information. As the Internet continued to grow exponentially, Google s innovative PageRank algorithm allowed the company to break away from the pack and eventually became synonymous with search on the Web. While this was all developing on the public Web, companies began to realize that they also have vast stores of information, both structured (enterprise applications, databases and spreadsheets) and semi-structured ( s, documents, presentations, multimedia, etc.) and that search technology can provide an effective means to uncover insights and correlations about things like customers, products, and markets. Enter Lucene/Solr and LucidWorks Lucene was accepted into the Apache Software Foundation in 2001 and became its own top-level project in By 2007, companies such as Netflix, eharmony, and Cisco, among others began adopting Apache Lucene as an open source alternative to proprietary search engines for both internal and customer-facing applications. Apache Solr began as in internal project at CNET to provide a better search experience for their web visitors. Leveraging Lucene search algorithms as the internals, Solr added a search server and significant additional capabilities. Solr was donated to the Apache foundation in 2006 and over the last several years has become tightly coupled with Lucene. Lucene/Solr has become the de facto standard in open source search in use at tens of thousands of companies and employs a thriving developer community that numbers in the thousands. Along Comes Hadoop and MapR In 2006, the Lucene project spawned a sub-project by the name of Hadoop. By early 2008, it had become a top-level Apache project and is now a de facto standard for big-data analysis. What began as a way to make the Nutch web crawler scale to handle larger and larger crawling jobs, it has since morphed into a general purpose, distributed file system and computation framework used in a wide variety of large scale applications such as log processing, data warehousing and much, much more. Improving the Search Experience the Next Frontier Products such as Apache Lucene/Solr and LucidWorks Search have commoditized enterprise search and have made it extraordinarily easy to deploy as well as easy to obtain highly relevant search results. As a result of LucidWorks substantial investment, it no longer requires a search engine expert or developer to stand up an enterprise search server and tune it for optimal performance. Search professionals no longer need to focus on the basics of search like how to index and search content. Instead, implementers can focus on the real challenge of improving the user s application experience by exploiting the intersection of content, content relationships, user interactions, and access. In most cases, in order to exploit this information, one needs a big data solution due to the large amounts of data and user interactions often seen in many applications. Even if there isn t a large amount of content, distributed computation is often useful for speeding up computationally intensive tasks like natural language processing.
3 Page 3 LucidWorks & MapR: Crowd-Sourced Intelligence Search Abuse An Evolutionary Step Forward As search has evolved beyond simple text retrieval, it has emerged as a building block for addressing tougher challenges like fuzziness, relevance ranking, and probabilities all across data stores that include structured and semi-structured information. Search abuse is the notion that software that was intended for one thing (text retrieval) is re-purposed as a building block for another thing (non-textual analytics). Specifically, all kinds of data, including structured data or records of user behavior, can be analyzed by a search engine as a component of a larger system. These kinds of solutions are possible because the underlying algorithms and data structures used to power a search engine can be effectively seen as a sparse matrix multiplication, and it just so happens that sparse matrix multiplication is often what is needed to power many of these next generation, data driven applications. It turns out that there are many places to use non-textual information in search applications: to perform NoSQL types of data retrieval, or to do large-scale machine learning, recommendations and much, much more. The Intersection of Search and Big Data Things begin to get really interesting when a search engine is used to analyze the behavior of users who are themselves using a search engine. Search experts compare keyword search to dehydrated food the basic nutrition is there, but it is not easily accessible until water is added. In this case, the water is behavioral information. Armed with other data sources, such as clicks, mouse tracks, ratings, and reviews, keyword search can be augmented and can lead to information discovery and ultimately to better decision-making. This is where tools like MapR s distribution for Hadoop enter into the search equation. By including all of this meta information about user behavior, search engines can find interesting patterns and correlations that feed directly back into search results. The end result is that the search system can reflect the behavior of subject matter experts back at other users who lack some of that training or experience. The system appears to users to act intelligently because it is reflecting intelligent action back at users. Big data plays a role in large-scale analysis, as well, by producing clusters, identifying trends and topics, finding statistically interesting phrases, similar documents and many more things that require an aggregate view of the data. These large-scale discovery components can encourage system users to experiment with the data and can lead to a virtuous cycle more people do search and discovery and their behavior contributes to improving search results and insights that can be derived. Similar to the Agile method of software engineering, where organizations are always in development cycles, search can always be refining results, based on user behavior. And, with potentially millions of users hitting a system, some subset will behave in clever ways that can be reflected back to general users and make them more productive. This results in the system appearing to be intelligent, although it is simply reflecting back the intelligence of others. Customer Examples and Use Cases Reflected intelligence has utility in a wide variety of situations as reflected in some examples in industries such as telecom, advertising, banking and insurance, education, government, and entertainment. Most of these use cases are not typical search applications there are no users entering search terms into text boxes. Instead, these cases largely use search components as pivotal elements to adding value to content that already exists. Social Media in Telecom Social media has evolved to become a key component of marketing for many types of companies and organizations. The first use of search as it relates to social media has been to find mentions of the company across social media sources such as Twitter or Instagram. The true power of search is revealed in cases where a company can make operational decisions based on insights derived from search. An example is a major telecom provider who mines social media and correlates it with cell tower data to predict additional capacity demands for sporting events, music festivals, emergencies, etc.
4 Page 4 LucidWorks & MapR: Crowd-Sourced Intelligence Social Media Analysis for Advertising Typical television advertising is done using a scattershot approach where targeting is based on demographic data that paints a broad picture of an audience, e.g., affluent women between years old. As a result, ad placement pricing is based on reaching a portion of this very broad segment. It is estimated that up to a 5x multiple could be derived if the ads were backed up by good analysis of who was actually watching, when they were watching, and what they thought about the advertising and brand. By combining insights from social media, advertisers can get as much as 80% of the total value of the ad from this analysis, as compared to the ad itself. Insurance Claims Processing and Analysis Insurance companies always want to have a better understanding of the claims they are processing, whether it is to detect fraud or to determine new trends or patterns that emerge from the pool of claims they see. Typical auto insurance claims include both tabular, attributed data, such as make, model, year, price, etc., and semi-structured data such as police reports, eyewitness reports, victim reports, etc. In the traditional data warehouse approach, analysts could ask question about the attributed data, but had no means to combine, rank or facet on the complete picture. In this example, a large insurance company took both the structured and semi-structured data into their search application and then enriched it with behavioral data. Specifically, they looked at what the analysts were working on and performed text analysis at a low level to identify trends and patterns. It turns out that they could identify trends such as seeing that in a particular make/model vehicle, just before a crash, people reported that their brakes failed. This data could be fed back to the NTSB and to manufacturers, as well as their own claims adjusters. Virginia Tech - Help the World in Crisis Virginia Tech s Crisis Tragedy Recovery Network serves as a resource to victims and their relatives as well as first responders and policy makers. Anytime there is a large national or global crisis natural or man-made the CTRN harvests content from the web, social media, news outlets, etc., and makes it immediately searchable as well as archived for future access. Over time, they employ large-scale natural language processing to identify trends, topics, themes, and relationships both inside an event and across multiple events to help policy makers and first responders develop systems and processes to improve response. Bright Planet Catch the Bad Guys Bright Planet is in the business of harvesting intelligence from the web beyond the reach of traditional search engines for use by governments, businesses, and organizations. Bright Planet s client in this case is a large pharmaceutical manufacturer who was looking for evidence of the sale of counterfeit drugs. While search can provide some answers, more analysis is needed since counterfeiters often carefully disguise their wares. Bright Planet looks for certain types of language and other indicators that they feed into their search algorithms along with enrichment data from how analysts are performing their analysis and what questions they are asking of the data. This results in new patterns that are detected and continuously refines and improves their analysis. Veoh Cross Recommendations Veoh is a video content network that allows subscribers to watch, follow, share, and comment on aggregated video content from around the web. Their innovative recommendation engine leverages user behavior (videos searched, watched, recommended, items clicked, words typed in, mouse tracks etc.) to influence recommendations and search results. They use behavior across the entire subscriber population to influence an individual s search results and coalesce all of these various signals into a single query system with what appeared to the user as magical results.
5 Page 5 LucidWorks & MapR: Crowd-Sourced Intelligence Getting Started with Reflected Intelligence There are several critical components needed to get starting building applications that leverage reflected intelligence. Fast, efficient, scalable search Lucene/Solr powers some of the world s largest websites and search applications with sub-second response against billions of records, so it makes a good choice for this fundamental component. Bulk and near-real-time indexing Distributed computing platform for performance and scale Storage capacity to store and work with raw data to transform it to address the kinds of questions that will be asked NLP and Machine Learning Tools to address semi- and unstructured data that will scale The natural language processing and machine learning tools are what will power the discovery and analysis. They provide the ability to crunch through all of the feedback and user behavior data to understand what people are clicking. To make this work at scale, the feedback must work seamlessly inside of the system with the appropriate workflows in place to eliminate the need for administrators to chase down log files from disparate systems. Reference Architecture for Reflected Intelligence This reference architecture handles a wide variety of data types both textual and behavioral. It also can handle an array of enrichment systems to elaborate and annotate documents for useful actions across a broad spectrum of business purposes. The enrichment systems can be batch oriented or large-scale offline, or near-real-time. Discovery and enrichment can be done as a rough cut at the time of content acquisition and can be re-clustered at a later date when more is known. The heart of this architecture is the document store represented by the grey cylinder in the middle of the diagram. Inside of this store are multiple shards that make up the document store and retrieval index. It contains text and semi-structured information, as well as structured information processed by ETL systems.
6 Page 6 LucidWorks & MapR: Crowd-Sourced Intelligence Discovery and enrichment processes run against recently added documents and look for patterns and enrichment opportunities that can improve search results. Enrichment can include classifiers and recommenders that can create special tags and indicators on documents to improve correlations. Analytic services are accessed via the general APIs that can query the system and may be explicit or implicit where they are derived from behavior or formed from other data sources. Query processes don t necessarily have to give results. Instead, they may be used to structure a website or notify an analyst when particular conditions are met. MapR Extends Hadoop for Reflected Intelligence MapR provides a technology-leading, complete distribution for Hadoop with enhancements that make Hadoop easy, dependable and fast. MapR distribution includes the different Apache projects from the Hadoop ecosystem such as Hive, Pig, HBase and Mahout over a platform that provides enterprise grade features such as direct access NFS, snapshots, mirroring and instant node recovery. easy MapR innovation allows users to access the Hadoop cluster through industry standard APIs. Some of the standards that are built-in and supported over MapR include full POSIX compliance, Network File Service (NFS), ODBC, Linux PAM and REST. Beyond the standards, MapR also provides multi-tenancy, data placement control and hardware level monitoring of the cluster. Dependable MapR provides some of the best features for running mission critical applications. Features include self-healing of critical services that maintain the nodes and the jobs, snapshots that allows for point in time recovery of data, mirroring that allows for inter-cluster replication over WAN and rolling upgrades that prevent service disruptions. Fast MapR is twice as fast as any other distribution. It leverages optimized shuffle algorithm, direct access to disk, built in compression and code written in advanced C++ to provide superior and unprecedented performance over Hadoop. MapR is particularly well suited for reflected intelligence applications. It provides an integrated data platform that can store file-like objects accessible through HDFS or NFS and table objects that exhibit the Hbase API. MapR supports real-time ingestion and processing for objects that store user behavior which are changing in real time. MapR s snapshot and mirroring capabilities are critical for reflected intelligence applications, as they support the evolution of large data objects over time. With these tools, new data can be layered on old data in what-if scenarios to assess the impact to an application. As search experts will attest, tuning a result set in one area can have unanticipated consequences in other areas, and this sort of impact analysis is crucial to good search hygiene. These snapshots support the always testing model of enrichment, where the search application continues to improve simply through the act of more people using the application over time. In addition, snapshots allow search professionals to play back what might have happened over a particular period of time and recreate situations for further troubleshooting. These capabilities go beyond ordinary Hadoop and make reflected intelligence applications possible. LucidWorks Extends Lucene/Solr for Reflected Intelligence LucidWorks is the leading provider of packaging, support, training, and knowledge about Apache Lucene/Solr. LucidWorks employs about a third of the committers to the open source project and was founded by a group of the committers to promote the adoption of Lucene/Solr. The company continues to contribute a considerable body of work back to the open source project each year. In the past year, the LucidWorks team worked to ensure Lucene/Solr can scale to handle Hadoop workloads.
7 Page 7 LucidWorks & MapR: Crowd-Sourced Intelligence LucidWorks offers LucidWorks Search, which adds a user interface for management and operations to Lucene/Solr, along with a connector framework for integrating to tools like MapR and common enterprise repositories such as SharePoint, file systems, etc., and it adds integration to organizations security access control lists. LucidWorks Big Data offers big data as a service. It is constructed very similarly to the reference architecture referred to earlier in this document. It incorporates LucidWorks Search, adds Hadoop and machine learning, along with pre-built workflows that eliminate the pain of moving the data around to be processed. The LucidWorks Big Data Marketecture The Big Data Operating Systems at the heart of this diagram is the reference architecture discussed earlier where LucidWorks Search is combined with Hadoop, Hbase, etc., and determines that the data is in the right place at the right time. On top of this substrate, Search, Discovery, and Analytics applications are built that leverage machine learning tools, natural language processing, and the tools needed to scale with pre-defined workflows. This is all accessible through a set of REST APIs, so a non-expert can interact with the services with common web services like REST and JSON. The right side of the diagram is the system management layer with the glue, like Zookeeper, and provisioning tools. To get content into the system, LucidWorks provides a variety of connector to a range of enterprise data sources, databases, S3 buckets, plus the system supports push data.
8 Page 8 LucidWorks & MapR: Crowd-Sourced Intelligence The LucidWorks/MapR Advantage The goal of the partnership between LucidWorks and MapR is to enable a rapid path to the next generation of search, by using reflected intelligence, along with other methods, to unlock correlations and insights from large data sets and ultimately drive better decisions for individuals and organizations. By using LucidWorks and MapR, organizations can quickly build reflected intelligence search applications where: Data can be ingested into MapR by a variety of methods, through Hadoop ecosystem components, or by storing data directly and transparently via NFS (for legacy components) Search indices can be stored in MapR and fed into a MapReduce setting into tools like Pig and Mahout or can be deployed using mirrors or NFS MapR snapshots make backups very simple Snapshots also allow scenarios to be replayed and to do experiment management correlate scoring factors, config files, log analysis etc., to see what users saw at the time LucidWorks connects transparently with MapR No unnatural acts are required logs are in NFS or file systems that MapR presents and can run MapReduce jobs over them without concern for where they reside LEARN MORE AND GET STARTED TODAy To learn more about using crowd sourcing reflected intelligence for search and big data please visit and A webinar with Grant Ingersoll, Chief Scientist for LucidWorks and Ted Dunning, Chief Application Architect for MapR can be found on either site. For a direct response, please or For more information, please visit MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission-critical and real-time production uses. MapR brings unprecedented dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and streaming applications in one unified Big Data platform. MapR is used across financial services, retail, media, healthcare, manufacturing, telecommunications and government organizations as well as by leading Fortune 100 and Web 2.0 companies. Amazon, Cisco, EMC and Google are part of MapR s broad partner ecosystem. Investors include Lightspeed Venture Partners, Mayfield Fund, NEA, and Redpoint Ventures MapR Technologies. All rights reserved. Apache Hadoop and Hadoop are trademarks of the Apache Software Foundation and not affiliated with MapR Technologies.