Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Similar documents
Search and Information Retrieval

Analysis of Web Archives. Vinay Goel Senior Data Engineer

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Building Multilingual Search Index using open source framework

Investigating Hadoop for Large Spatiotemporal Processing Tasks

CloudSearch: A Custom Search Engine based on Apache Hadoop, Apache Nutch and Apache Solr

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

Indexing big data with Tika, Solr, and map-reduce

SEO 101. Learning the basics of search engine optimization. Marketing & Web Services

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

WHAT'S NEW IN SHAREPOINT 2013 WEB CONTENT MANAGEMENT

INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN

Information Technology Career Cluster Web Development Course Number: Course Standard 1

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-4, Issue-4) Abstract-

Online Traffic Generation

Search Result Optimization using Annotators

2013 AmLaw 100 Websites: Ten Foundational Best Practices Research

Google Analytics for Robust Website Analytics. Deepika Verma, Depanwita Seal, Atul Pandey

Search Engine Submission

Increasing Traffic to Your Website Through Search Engine Optimization (SEO) Techniques

Digital Marketing Training Institute

Client Overview. Engagement Situation. Key Requirements

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

SharePoint 2010 Interview Questions-Architect

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Microsoft FAST Search Server 2010 for SharePoint Evaluation Guide

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

SEO Services. Climb up the Search Engine Ladder

Yandex: Webmaster Tools Overview and Guidelines

38 Essential Website Redesign Terms You Need to Know

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Website Report: To-Do Tasks: 11 SEO SCORE: 79 / 100. Missing heading tag: H5. Missing heading tag: H6

The Art Institute of Atlanta IMD 470 Special Topics: Findability. Alex Reeve Sams SEO and Marketing Kit. Aarron Walter

search engine optimization sheet

Using Apache Solr for Ecommerce Search Applications

EVILSEED: A Guided Approach to Finding Malicious Web Pages

Search Engines. Stephen Shaw 18th of February, Netsoc

CiteSeer x in the Cloud

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Website Report: To-Do Tasks: 0. Speed SEO SCORE: 73 / 100. Load time: 0.268s Kilobytes: 1 HTTP Requests: 0

~~Free# SEO Suite Ultimate free download ]

Drupal CMS for marketing sites

Collecting Polish German Parallel Corpora in the Internet

Visualizing a Neo4j Graph Database with KeyLines

Unlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach

This is a living document that can be changed or updated at any time. Any unforeseen costs will be agreed upon by both parties before proceeding.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

White Paper On. Single Page Application. Presented by: Yatin Patel

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

SOCIAL MEDIA: The Tailwind for SEO & Lead Generation

Design and Implementation of Domain based Semantic Hidden Web Crawler

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

Fusesix. Design Programming Development Marketing. Fusesix Web Services South Carolina, USA. Phone:

International Journal of Computer Engineering and Applications, Volume IX, Issue VI, Part I, June

INTERNET MARKETING. SEO Course Syllabus Modules includes: COURSE BROCHURE

Collecting and Providing Access to Large Scale Archived Web Data. Helen Hockx-Yu Head of Web Archiving, British Library

Chapter: 9 Forms (7.0.5)

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

1. SEO INFORMATION...2

Social Media Mining. Data Mining Essentials

Website Report for by Cresconnect UK, London

Active Learning SVM for Blogs recommendation

isecure: Integrating Learning Resources for Information Security Research and Education The isecure team

Machine Learning using MapReduce

A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION

HubSpot's Website Grader

WHAT IS A SITE MAP. Types of Site Maps. vertical. horizontal. A site map (or sitemap) is a

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

SEO Techniques for Higher Visibility LeadFormix Best Practices

The Open Source Knowledge Discovery and Document Analysis Platform

SEARCH ENGINE OPTIMIZATION

Hypertable Architecture Overview

A Practical Attack to De Anonymize Social Network Users

Knowledge Discovery from patents using KMX Text Analytics

Getting Started with the new VWO

Introduction 3. What is SEO? 3. Why Do You Need Organic SEO? 4. Keywords 5. Keyword Tips 5. On The Page SEO 7. Title Tags 7. Description Tags 8

Online Marketing Optimization Essentials

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Hybrid Approach to Search Engine Optimization (SEO) Techniques

How To Use Hadoop For Gis

Web Document Clustering

Visualizing an OrientDB Graph Database with KeyLines

Transcription:

Brian Lao - bjlao Karthik Jagadeesh - kjag Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND There is a large need for improved access to legal help. For example, 9 out of 10 employees had a legal concern in the last year [1]. Additionally, under-served U.S. citizens with legal issues comprise a latent legal market valued at $45 billion [1]. The latent legal market exists for a variety of reasons. First, the traditional lawyer costs often discourage consumers from seeking any help. Second, even if consumers have decided on wanting to use legal counsel, finding a lawyer with the relevant expertise can be difficult. However, with the prevalence of the Internet and the digitization of information, individuals have become increasingly inclined to look online for obtaining legal answers. For example, 3 out of 4 consumers who are seeking an attorney now use online resources to help them find their legal counsel [2]. And a recent ABA study has shown that, out of the online models that people use to help them find a lawyer, Q&A websites were the most popular [3]. II. PROBLEM AND SOLUTION Although there exists a wealth of information freely available on the Internet, much of it remains disorganized and it can be difficult to find the exact info one is interested in. Google does an excellent job of helping consumers find relevant links for a query, but it is a general search engine with results that are not always amenable to answering a legal question. For example, if a legal question related to enforcing non-compete agreements is typed into Google, the top search results are general definitions of the legal term as well as recent, related news. This lack of specificity can make it difficult for an average consumer to get the answer they want to their legal question. On the other end of the spectrum are legal case search engines like CaseText. CaseText, for example, returns a set of relevant cases that are annotated by legal professionals and professors. However, CaseText s target demographic comprises lawyers, academics and law students. An average consumer with a legal question is not necessarily looking for technical cases and statutes. There are a set of services similar to Rocket Lawyer where questions are entered online, and these questions will be sent to relevant lawyers on the back-end who can answer the questions directly. While these services provide answers from legal professionals who are knowledgeable in the area, they fail to address the growing demand for instantaneous results that web consumers have become accustomed to due to services like Google. Since a large proportion of answers to legal questions are already publicly available on the Internet through legal Q&A websites like Yahoo Answers and Avvo, the goal is to create a platform that can aggregate these responses and make them easy to find for the average user. We thus focused on creating a legalspecific search engine that allows the user to enter a fully-formed legal question and situation. After applying natural language processing and search ranking algorithms, legal-specific search results are then returned to the user with a focus on providing the most relevant sets of questions and answers from legal Q&A websites.

III. METHODS AND PRODUCT Overview - We have built an initial version of a search engine that finds relevant previously answered question and answers from the web and presents the results to the user in a clear, easy to understand format. This system consists of 3 major components, and we have used well-known open source software projects to build these parts of the system. Here is a brief description of the different components: (1) Crawling and Indexing For web crawling, we used an open-source Java-based system called Apache Nutch, which crawls a seed set of inputted HTML URLs. By crawling websites like Avvo, we get the Q&A content, which can then be presented to the user. After crawling the content, we need to organize the Q&A results by labeling them with tags so users can efficiently find relevant legal answers. Upon crawling the info from these Q&A websites, we index the questions, answers, and their relevant tags into a database. For indexing and storage, we use another open-source Java-based system called Apache SOLR. We integrate the Apache Nutch framework together with the Apache SOLR framework in order to have the crawler feed the crawled information directly into SOLR for indexing and storing. (2) Query Processing After a user enters in a legal question query, we need to then take that legal question and compare it against our index of information to find the most relevant search results. To accomplish this, we once again leverage Apache SOLR, which utilizes a distance metric to determine the similarity between two documents. This allows us to quantitatively determine which Q&A questions should be shown to the user. (3) Front-end and Back-end Interaction The third and final step is to display results on the front-end for the user to read. We used a JavaScript library called Ajax SOLR, which allows us to access the data indexed by SOLR in the database back-end and send it to the front-end to be displayed. We can use HTML and CSS to customize the actual layout and user interface of the page. The above is a brief overview of the methods we used, and we will now explain each of the individual components in detail. Crawling and Indexing - Building a legal search engine requires a large index of legal information, where a subset of that large index would eventually be shown to the user as the search results for their query. To obtain this large index of legal information, we used a web crawler that browses specified web domains in an effort to crawl their content. We leveraged the open-source Java-based web crawler, Apache Nutch [4]. With Nutch, one can specify specific seed URLs for the crawler to start its descent. The crawler will grab the content from the initial seed IR:, and then move onto links that were present on the seed webpage. By modifying parameters such as crawling depth and the number of links to crawl, one can better tailor the crawler for obtaining the desired legal information. Given that we are building a legal-focused search engine, we limited the crawled web domains to legal-related content, such as the Avvo and Rocket Lawyer websites. By default, Apache Nutch does not perform any advanced parsing of a webpage s content. All text is treated the same. However, for legal Q&A websites, we wanted to be able to have the crawler parse the content in a more meaningful way. First, a document for us consisted of a single legal question and the lawyers answers. For a single document, we wanted the legal question to be indexed as a separate field than the answers. Second, we wanted to be able to index certain tags for the legal Q&A questions, such as the practice area that the question

pertained to and the state in which the question was asked. All of this info was found within the given webpage s HTML structure, but advanced parsing was required to identify the target info and retrieve it. Thus, we combined Nutch s crawling abilities with JSoup, an open-source Javabased HTML parser. We were able to use JSoup to decompose a given legal question and answer into its various parts and tags by using CSS selectors. In addition to legal question and answers, we also aim to crawl other legal-related content. This content includes websites with relevant legal info, law firm websites, cases and statutes, and related legal news or blogs. JSoup will allow us to more meaningfully parse this non-q&a legal content. An impressive feature of Nutch is its ability to run on Hadoop, which is an open-source framework that leverages clusters of processors for distributed storage and analysis of large data sets. Although we have thus far been using a single computer to crawl and store information, we would eventually need to migrate to a cluster of processors for speed purposes if our legal search engine gained enough traction. The crawling and indexing process is illustrated in Figure 1. Figure 1. Diagram illustrates the various steps in crawling the Internet for legal-related content, parsing that content, and indexing that content on a storage server. Query Processing and Search Results After crawling the web and creating a large dataset of legal-related material, we needed to develop methods to efficiently return the most relevant and personalized results to an individual. This is an important component of the search engine because as the amount of legal data we collect from the web increases, it will take more time to process and return the relevant documents. Apache SOLR plugs nicely into the Apache Nutch web crawler so we can easily index the data that we are collecting from the web. Additionally, the input query from the user should go through a set of basic pre-processing steps so that we can better find similar documents stored in the SOLR index. The pre-processing steps involve fixing all spelling issues in the sentence, removing all the punctuation within the query, and finally removing the stop words in the query. This set of steps helps to normalize the words

in the query so that they can easily be compared against the words in each of the documents we have stored in SOLR. In order to compare how similar a query q is with a document d, there are several different methods that can be used. We have started by utilizing the basic cosine similarity metric. The first step involves converting the words in all the documents (after preprocessing) into a set of vectors where component i consists of the Term Frequency-Inverse Document Frequency (TF-IDF) score for a given word w_i within the document. The set of all w_i s will consist of all the words that appear at least once in one of the documents. The TF-IDF score captures frequency of each word within the document and gives a higher weightage to words that are extremely rare across all documents (Figure 2). With this vector representation of each document, we can then easily apply the cosine similarity metric and determine the similarity between any two documents in the set. The cosine similarity metric calculates the angle between the vector representations of the two documents. Intuitively, we see that the larger the angle between two vectors, the more different the underlying documents will be. Similarly, the smaller the angle between two vectors, the more similar the two documents will be. The model we have described is very basic and only uses the word frequency as a metric to capture similarity between two documents. Despite the simplicity of the model, the search results have generally been turning out to be very relevant to the input query, which is a positive sign! We hope to incorporate more advanced natural language understanding techniques that can take advantage of the fully-formed nature of a given user s query. Figure 2: Each document is represented as an n dimensional vector where component i represents the TF-IDF score for word i in the corpus. Front-End & Back-End The above methods discussed the processes that would occur on the back-end. However, the back-end processes need to be connected with a front-end interface that would be shown and used by an individual. To accomplish this, we used AJAX SOLR [5], which is a JavaScript framework for connecting a front-end interface to an Apache SOLR back-end. On the front-end, a user would enter in a legal question, and optionally, the state in which they reside as well as the relevant area of law. The AJAX SOLR framework then employs a manager that takes the front-end query and sends it to the back-end SOLR server. SOLR then performs the search algorithms to find the most relevant search results in the index. These results are sent back through the AJAX SOLR manager, which then translates

those results into HTML that can be used in the front-end to show to the user. The process is illustrated in Figure 3. Figure 3. Diagram of how AJAX SOLR is used as a bridge between the front-end user interface and the back-end processes. We modified the AJAX SOLR framework to provide certain search engine features such as pagination of our legal search results and filters. Given that tags were added to the legal questions and answers in the crawling and indexing stage, we created a front-end interface that allows for users to filter their search results by state or practice area. Thus, a user can select the California filter to only see the legal questions & answers from California. IV. CONCLUSION We believe that building a consumer-oriented legal search engine that helps an individual find answers to their legal questions will create real value. There are several legal Q&A web services that push questions to actual lawyers, but none of these services provide instantaneous results. We hope to leverage the currently available legal information in order to provide a unique solution to a problem that people face. We have been able to build a basic prototype legal search engine using open-source tools that are freely available. The preliminary results are promising and provide some validation to the tool that we are building. We plan on further pursuing this project by building out and publishing online a fully-featured search engine to help answer legal queries.

V. REFERENCES [1] Granat, Richard. elawyering for Competitive Advantage. elawyering Taskforce. ABA. Sep 20, 2011. http://www.kentlaw.edu/faculty/rstaudt/classes/justicetech_fall2011/ppts/granatchicagokentlawsc hool%20%282%29.ppt [2] Bodine, Larry. Most Consumers go Online to look for an Attorney. LexisNexis. May 14, 2012. http://www.lexisnexis.com/community/portal/blogs/bodinelx/archive/2012/05/14/mostconsumers-go-online-to-look-for-an-attorney.aspx [3] Cassidy, Richard. Perspectives on Finding Personal Legal Services. ABA. February 2011. http://www.americanbar.org/content/dam/aba/administrative/delivery_legal_services/20110228_ aba_harris_survey_report.authcheckdam.pdf [4] Nioche, Juliene et al. Apache Nutch. Apache Foundation http://nutch.apache.org/ [5] James, Mckinney. A JavaScript framework for creating user interfaces to Solr. April 11, 2013 https://github.com/evolvingweb/ajax-solr