# Auto-Grouping s For Faster E-Discovery

Save this PDF as:

Size: px
Start display at page:

## Transcription

3 Algorithm ComputeSignatures (Document d) % Initialize all n signatures to the maximum value For (i = 1; i n; i + +) signature i = MAX For each k length character sequence seq in doc d For (i = 1; i n; i + +) Let checksum i = f i (seq) If checksum i < signature i then signature i = checksum i % Keep only the j lowest significant bits For (i = 1; i n; i + +) signature i = lsb j (signature i ) Figure 3: Signature computation algorithm. 3.2 Our Approach For detecting almost identical content, we adopt a signature based technique. Signature based techniques are not only faster to compute but can also be efficiently stored in an index structure and can thus easily be used for detecting near duplicates with varying degrees of similarity thresholds at runtime. Signature based techniques generate a set of signatures for each document and use them to compute a similarity measure between a pair of documents. We specifically use the resemblance measure [4] to compute the similarity of a pair of documents which is defined as follows: Sim(d 1, d 2) = S1 S2 S 1 S 2 (1) where, d 1 and d 2 are two given documents and S 1 and S 2 are the set of all the substrings of length k that appear in d 1 and d 2 respectively. A high value of resemblance signifies a high similarity in the content of the given documents. Figure 3 describes a method ComputeSignatures that generates n signatures for each document. The method uses n different oneto-one functions referred to as f i in the Figure 3 that are kept the same for all documents. Each function is a 64 bit Rabin fingerprinting function. For a given document d, all the k length character sequences are considered. The value of k is empirially determined and in our experiments, we use a value of k = 20. Note, that if the document d contains p characters then there will be p k + 1 such sequences. For each f i all the p k + 1 values are computed and the smallest of them is kept as the i th signature. For efficiency reasons we keep only the j lowest significant bits for each of the smallest signatures which is computed by the function lsb j in the Figure 3. Thus the value of signatures range from 0 to 2 j 1. This method has some similarities to the method proposed in [10]. It cannot be proved that these functions are min-wise independent [4] but they are found to be effective in practice [10] [13]. Assuming a uniform probability distribution on the values, the probability of two documents having the same signature is 1 2 j, where each signature is a j bit value. This is the probability that two documents will have the same signature by chance. The probability that two documents will have the same m signatures by chance can be given by ( 1 2 j ) m. This value could be very high for low value of j and m. A high value of this probability will increase the occurrence of false positives. We investigate appropriate values for these parameters in Appendix A. 3.3 Groups of Near Duplicate s Each signature value can be used as a dimension and an can be represented as a feature vector in this space. Traditional clustering algorithms such as K-means or hierarchical agglomera- Procedure CreateSyntacticGroups(Index I) For each document d in Index I do If d is not covered Let S = {s 1, s 2,..., s n} be its signatures D = Query(I, atleast(s, k)) % make all the documents in D covered For each document d in D d is covered Figure 4: Creating groups of syntactically similar documents. tive clustering (HAC) [2] can then be used to create groups of near duplicate s. However, the traditional clustering algorithms are in-memory algorithms and cannot be easily used for huge datasets with millions of documents. We therefore develop a clustering method that works out of an index. For each n signatures are computed and indexed along with the content and other meta fields associated with the document. We then use the procedure CreateSyntacticGroups outlined in Figure 4 to create groups of near duplicate s. It uses a parameter referred to as k that denotes the minimum number of signatures that needs to be the same for a pair of documents to belong to the same group. The procedure finds groups of near duplicate s by first picking up an uncovered d from the index. It then searches for all the s that have at least k same signatures as the n signatures associated with the d. This set can easily be identified using a single query on the index using the posting list associated with each signature value. In the Figure 4, the function Query(I, atleast(s, k)) returns the set of s that have at least k signatures from the set S. Other kinds of constraints concerning the hash value of the associated attachments can also be included in the query. These kinds of queries are supported in several off-the-shelf indexing frameworks such as Lucene 1. The returned set forms a set of near duplicate s as all of them have at least k same signature values. The procedure continues with the next uncovered to find other groups of near duplicate s. Note that while reviewing a particular group at review time, an analyst can change the value of k to obtain more or less fine grained groups. Creating the group with a new value of k involves running a single query against the index and therefore can be done in an on-line fashion. 3.4 Groups of Automated s We observe that there are several types of automated messages in an enterprise collection of s. Automated messages are frequently used for various purposes such as automatic reminders, confirmations, out of office messages, and claims processing. These messages are typically generated based on a template which consists of some textual content along with a set of fields that are filled in during message generation. All the messages that are generated from a single template have almost identical content and are therefore near duplicates of each other. Figure 5 shows four example s from our large testing set that are generated automatically along with a possible template (in the rectangular box at the bottom) that could be used to generate these messages. For each , only the subject and content fields are shown. The subject and content of all the s are identical except at a few places that are highlighted in the figure. In our analysis, we found as many as 3, 121 s in the large 500K collection that are generated using a similar template and are near duplicates of the examples given in the Figure

9 [23] Y. Wu and D. Oard. Indexing s and threads for retrieval. Proceedings of SIGIR, pages , [24] Jen Yuan Yeh and Aaron Harnly. thread reassembly using similarity matching. Proceedings of CEAS, [25] J. Zawinski. Message Threading, APPENDIX Appendix A describes the experiments for the evaluation of the near duplicate detection technique used in the paper for grouping s. Appendix B presents a small user study that shows how auto-grouping of s can significantly reduce the time needed for document reviews. Finally in Appendix C, we describe the integration of many of these features into IBMs ediscovery Analyzer product offering. A. NEAR DUPLICATE DETECTION In this experiment, we measure the quality of the near duplicate s detected by our proposed method. Similar to thread detection, near duplicate document detection performance is quantified in terms of precision and recall. Since we did not have a list of all the near duplicate documents for a given document in the large set we used, it will have been very difficult to quantify the recall of our method. As a result, in this experiment we only focused on the precision of the returned set of near duplicates. For quantifying precision, we manually reviewed sets of near duplicate s returned in response to 500 queries. The queries were constructed using a randomly selected set of 15 s. We measured precision as a function of two variables: 1) the number of bits used to store the signatures and 2) the level of similarity threshold used for search. We experimented with a minimum of 6 bits to a maximum of 14 bits for storing signatures. In all cases, 10 signatures are generated for each . For the similarity threshold, we experimented with a maximum of 100% similarity to a minimum of 50% similarity. Note that the similarity threshold values are translated into a minimum number of signatures that needs to be shared by a document for it to be considered a near duplicate of the given document. Note that using a similarity threshold level of 100% is not the same as finding exact duplicates using a hashing function such as MD5. s may have exactly the same signatures (e.g., 100% similarity) even if their contents differ slightly whereas their MD5 values will be completely different. Figure 10 shows the precision for various configurations. We observe that the precision drops drastically when we use only 6 bits for each signature. We observe, in this particular case, a precision of 100% even with a low similarity threshold of 50% when we use more bits (10, 12 and 14 bits) for the signatures. It is, of course, not clear that 100% precision would be possible in all cases. Results would vary based on the type and actual content of s. These results support the observation made in Section 3 that lower values for bits per signature and lower levels for similarity threshold have a higher probability of false positives. Since it is important to use more bits to represent the signatures, we next examined the impact of the signature bit size on the size of the index. For this experiment, we generated ten signatures for each and varied the range of each signature from 6 bits to 14 bits with an interval of 2. In our index, we do not have a provision for storing integers and rather store them as strings. Figure 11 shows the increase in index size as a function of the number of bits used for each signature. Overall, the storage requirement for signatures is very modest. Even after storing ten signatures of 14 bits for every document, the index size increases by less than 1.9%. In our next experiment, we investigated how much near duplicate document detection can help in the review process. As pointed out Figure 10: Precision of results obtained for several s with different size signatures and similarity threshold. Figure 11: Increase in index size for storing signatures. earlier, since we cannot measure the recall of our method (due to unavailable reference assessments), we instead computed the number of near duplicate s that can potentially be determined using our method with a precision of 100%. We are able to do this since we manually reviewed the near duplicate sets returned by the system. We create sets of documents such that all the near duplicate s are grouped together. In this test, we used 14 bits per signature and generated near duplicate sets with varying levels of similarity threshold ranging from 100% to 50%. Since most of the content of the s in a set is the same, all s in the set can be reviewed together which can potentially save a lot of review time. Figure 12 shows the number of near duplicate sets obtained in the large collection with varying levels of the similarity threshold. The graph shows that only 33.4% of the s are unique at a similarity threshold of 50%. Therefore, more than 66% of the s are either exact or near duplicate messages. The results suggest that identifying exact or near duplicate s or messages and reviewing them together as a group can lead to significantly reduced review times. The large set contains many exact duplicates. In our next experiment, we quantified the number of exact duplicate documents so we can get a clearer picture of how many additional documents are detected by the near duplicate method. To detect exact duplicates, we used the MD5 hash algorithm on the contents. Figure 13 shows the number of s with a certain number of copies in the large set. Note that the vertical axis in the graph is in Log scale. This is a power distribution showing a large number of s with a small number of copies and a small number of s with very large numbers of copies. We found 116,728 s that had no copies. At the other extreme, we found one that has as many as 112 exact duplicates in the set. Overall we found 51.8% 1292

### Automated Concept Extraction to aid Legal ediscovery Review

Automated Concept Extraction to aid Legal ediscovery Review Prasad M Deshpande IBM Research India prasdesh@in.ibm.com Sachindra Joshi IBM Research India jsachind@in.ibm.com Thomas Hampp IBM Software Germany

### Email Spam Detection Using Customized SimHash Function

International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

### Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

### Clearwell Legal ediscovery Solution

SOLUTION BRIEF: CLEARWELL LEGAL ediscovery SOLUTION Solution Brief Clearwell Legal ediscovery Solution The Challenge: Months Delay in Ascertaining Case Facts and Determining Case Strategy, High Cost of

### 2015 Workshops for Professors

SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

### Search and Information Retrieval

Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

### Winning with Equivio Software for near-duplicates and email threads

Winning with Equivio Software for near-duplicates and email threads Winning with Equivio Equivio groups near-duplicates and structures email threads. By grouping similar documents, Equivio users gain a

### WEB PAGE CATEGORISATION BASED ON NEURONS

WEB PAGE CATEGORISATION BASED ON NEURONS Shikha Batra Abstract: Contemporary web is comprised of trillions of pages and everyday tremendous amount of requests are made to put more web pages on the WWW.

### SPATIAL DATA CLASSIFICATION AND DATA MINING

, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

### The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

### Web Document Clustering

Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

### Veritas ediscovery Platform

TM Veritas ediscovery Platform Overview The is the leading enterprise ediscovery solution that enables enterprises, governments, and law firms to manage legal, regulatory, and investigative matters using

### Search Engine Architecture I

Search Engine Architecture I Software Architecture The high level structure of a software system Software components The interfaces provided by those components The relationships between those components

### Mining Text Data: An Introduction

Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

Symantec ediscovery Platform, powered by Clearwell Data Sheet: Archiving and ediscovery The brings transparency and control to the electronic discovery process. From collection to production, our workflow

### Three Methods for ediscovery Document Prioritization:

Three Methods for ediscovery Document Prioritization: Comparing and Contrasting Keyword Search with Concept Based and Support Vector Based "Technology Assisted Review-Predictive Coding" Platforms Tom Groom,

### IBM ediscovery Identification and Collection

IBM ediscovery Identification and Collection Turning unstructured data into relevant data for intelligent ediscovery Highlights Analyze data in-place with detailed data explorers to gain insight into data

### Auto-Classification for Document Archiving and Records Declaration

Auto-Classification for Document Archiving and Records Declaration Josemina Magdalen, Architect, IBM November 15, 2013 Agenda IBM / ECM/ Content Classification for Document Archiving and Records Management

### Search Result Optimization using Annotators

Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

### META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING Ramesh Babu Palepu 1, Dr K V Sambasiva Rao 2 Dept of IT, Amrita Sai Institute of Science & Technology 1 MVR College of Engineering 2 asistithod@gmail.com

### Litigation Support. Learn How to Talk the Talk. solutions. Document management

Document management solutions Litigation Support glossary of Terms Learn How to Talk the Talk Covering litigation support from A to Z. Designed to help you come up to speed quickly on key terms and concepts,

### A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### SRC Technical Note. Syntactic Clustering of the Web

SRC Technical Note 1997-015 July 25, 1997 Syntactic Clustering of the Web Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig Systems Research Center 130 Lytton Avenue Palo Alto, CA 94301

### An Evolutionary-Based Method for Reconstructing Conversation Threads in Corpora

Mostafa Dehghani Masoud Asadpour Azadeh Shakery University of Tehran Tehran, Iran. An Evolutionary-Based Method for Reconstructing Conversation Threads in Email Corpora ASONAM 2012 Presented by: Mostafa

### Sample Electronic Discovery Request for Proposal

[COMPANY LOGO] Sample Electronic Discovery Request for Proposal Table of Contents OVERVIEW... 3 IMPORTANT CONSIDERATIONS FOR VENDOR SELECTION... 3 SECTION A: COMPANY PROFILE... 4 SECTION B: SCHEDULE &

### CONCEPTCLASSIFIER FOR SHAREPOINT

CONCEPTCLASSIFIER FOR SHAREPOINT PRODUCT OVERVIEW The only SharePoint 2007 and 2010 solution that delivers automatic conceptual metadata generation, auto-classification and powerful taxonomy tools running

### Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

### Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior N.Jagatheshwaran 1 R.Menaka 2 1 Final B.Tech (IT), jagatheshwaran.n@gmail.com, Velalar College of Engineering and Technology,

### LexisNexis Concordance Evolution Amazing speed plus LAW PreDiscovery and LexisNexis Near Dupe integration

LexisNexis Concordance Evolution Amazing speed plus LAW PreDiscovery and LexisNexis Near Dupe integration LexisNexis is committed to developing new and better Concordance Evolution capabilities. All based

### Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

### HELP DESK SYSTEMS. Using CaseBased Reasoning

HELP DESK SYSTEMS Using CaseBased Reasoning Topics Covered Today What is Help-Desk? Components of HelpDesk Systems Types Of HelpDesk Systems Used Need for CBR in HelpDesk Systems GE Helpdesk using ReMind

### Automatic Web Page Classification

Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### IBM Policy Assessment and Compliance

IBM Policy Assessment and Compliance Powerful data governance based on deep data intelligence Highlights Manage data in-place according to information governance policy. Data topology map provides a clear

### Viewpoint ediscovery Services

Xerox Legal Services Viewpoint ediscovery Platform Technical Brief Viewpoint ediscovery Services Viewpoint by Xerox delivers a flexible approach to ediscovery designed to help you manage your litigation,

### Keywords: Regression testing, database applications, and impact analysis. Abstract. 1 Introduction

Regression Testing of Database Applications Bassel Daou, Ramzi A. Haraty, Nash at Mansour Lebanese American University P.O. Box 13-5053 Beirut, Lebanon Email: rharaty, nmansour@lau.edu.lb Keywords: Regression

### ediscovery Solution for Email Archiving

ediscovery Solution for Email Archiving www.sonasoft.com INTRODUCTION Enterprises reliance upon electronic communications continues to grow with increased amounts of information being shared via e mail.

### Predictive Coding Defensibility and the Transparent Predictive Coding Workflow

WHITE PAPER: PREDICTIVE CODING DEFENSIBILITY........................................ Predictive Coding Defensibility and the Transparent Predictive Coding Workflow Who should read this paper Predictive

### Observing Data Quality Service Level Agreements: Inspection, Monitoring and Tracking WHITE PAPER

Observing Data Quality Service Level Agreements: Inspection, Monitoring and Tracking WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 DQ SLAs.... 2 Dimensions of Data Quality.... 3 Accuracy...

### Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

### Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

### Amazing speed and easy to use designed for large-scale, complex litigation cases

Amazing speed and easy to use designed for large-scale, complex litigation cases LexisNexis is committed to developing new and better Concordance Evolution capabilities. All based on feedback from customers

### Facilitating Business Process Discovery using Email Analysis

Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process

### Reduce Cost and Risk during Discovery E-DISCOVERY GLOSSARY

2016 CLM Annual Conference April 6-8, 2016 Orlando, FL Reduce Cost and Risk during Discovery E-DISCOVERY GLOSSARY Understanding e-discovery definitions and concepts is critical to working with vendors,

### Predictive Coding Defensibility and the Transparent Predictive Coding Workflow

Predictive Coding Defensibility and the Transparent Predictive Coding Workflow Who should read this paper Predictive coding is one of the most promising technologies to reduce the high cost of review by

### One Decision Document Review Accelerator. Orange Legal Technologies. OrangeLT.com Info@OrangeLT.com

One Decision Document Review Accelerator Orange Legal Technologies OrangeLT.com Info@OrangeLT.com By the Numbers: The Need for Technology in Attorney Review Seventy. Integrated near- duplicate detection

### A Deduplication-based Data Archiving System

2012 International Conference on Image, Vision and Computing (ICIVC 2012) IPCSIT vol. 50 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V50.20 A Deduplication-based Data Archiving System

### Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)

+ Hadoop-based Open Source ediscovery: FreeEed (Easy as popcorn) + Hello! 2 Sujee Maniyam & Mark Kerzner Founders @ Elephant Scale consulting and training around Hadoop, Big Data technologies Enterprise

### COMPONENT TECHNOLOGIES FOR E-DISCOVERY AND PROTOTYPING OF SUIT-COPING SYSTEM

COMPONENT TECHNOLOGIES FOR E-DISCOVERY AND PROTOTYPING OF SUIT-COPING SYSTEM Youngsoo Kim, Dowon Hong Electronics & Telecommunications Research Institute (ETRI), Korea blitzkrieg@etri.re.kr, dwhong@etri.re.kr

### Observing Data Quality Service Level Agreements: Inspection, Monitoring, and Tracking

A DataFlux White Paper Prepared by: David Loshin Observing Data Quality Service Level Agreements: Inspection, Monitoring, and Tracking Leader in Data Quality and Data Integration www.dataflux.com 877 846

### AXIGEN Mail Server Reporting Service

AXIGEN Mail Server Reporting Service Usage and Configuration The article describes in full details how to properly configure and use the AXIGEN reporting service, as well as the steps for integrating it

### Sample Electronic Discovery Request for Proposal

[COMPANY LOGO] Sample Electronic Discovery Request for Proposal Table of Contents OVERVIEW... 3 IMPORTANT CONSIDERATIONS FOR VENDOR SELECTION... 3 SECTION A: COMPANY PROFILE... 4 SECTION B: SCHEDULE &

### A Business Process Services Portal

A Business Process Services Portal IBM Research Report RZ 3782 Cédric Favre 1, Zohar Feldman 3, Beat Gfeller 1, Thomas Gschwind 1, Jana Koehler 1, Jochen M. Küster 1, Oleksandr Maistrenko 1, Alexandru

### TIBCO Spotfire Metrics Modeler User s Guide. Software Release 6.0 November 2013

TIBCO Spotfire Metrics Modeler User s Guide Software Release 6.0 November 2013 Important Information SOME TIBCO SOFTWARE EMBEDS OR BUNDLES OTHER TIBCO SOFTWARE. USE OF SUCH EMBEDDED OR BUNDLED TIBCO SOFTWARE

### A Review of Anomaly Detection Techniques in Network Intrusion Detection System

A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In

Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org

### Integrated email archiving: streamlining compliance and discovery through content and business process management

Make better decisions, faster March 2008 Integrated email archiving: streamlining compliance and discovery through content and business process management 2 Table of Contents Executive summary.........

### Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

### Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

### Data Mining for Knowledge Management. Classification

1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

### Flattening Enterprise Knowledge

Flattening Enterprise Knowledge Do you Control Your Content or Does Your Content Control You? 1 Executive Summary: Enterprise Content Management (ECM) is a common buzz term and every IT manager knows it

### Discovery Assistant Near Duplicates

Discovery Assistant Near Duplicates ImageMAKER Discovery Assistant is an ediscovery processing software product designed to import and process electronic and scanned documents suitable for exporting to

### Automatic Identification of user goals in web search based on classification of click through results

Automatic Identification of user goals in web search based on classification of click through results Synopsis of the Thesis to be submitted for the Award of the Degree of Masters of Technology in Computer

### Web Usage Mining. discovery of user access (usage) patterns from Web logs What s the big deal? Build a better site: Know your visitors better:

Web Usage Mining Web Usage Mining Web Usage Mining discovery of user access (usage) patterns from Web logs What s the big deal? Build a better site: For everybody system improvement (caching & web design)

### Final Project Report

CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

### Workflow Solutions for Very Large Workspaces

Workflow Solutions for Very Large Workspaces February 3, 2016 - Version 9 & 9.1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

### ARCHIVING FOR EXCHANGE 2013

White Paper ARCHIVING FOR EXCHANGE 2013 A Comparison with EMC SourceOne Email Management Abstract Exchange 2013 is the latest release of Microsoft s flagship email application and as such promises to deliver

### Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

### Reducing E-Discovery Cost by Filtering Included Emails

Reducing E-Discovery Cost by Filtering Included Emails Tsuen-Wan Johnny Ngan Core Research Group Symantec Research Labs 900 Corporate Pointe Culver City, CA 90230 Abstract As businesses become more reliance

### 131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

### ZL UNIFIED ARCHIVE A Project Manager s Guide to E-Discovery. ZL TECHNOLOGIES White Paper

ZL UNIFIED ARCHIVE A Project Manager s Guide to E-Discovery ZL TECHNOLOGIES White Paper PAGE 1 A project manager s guide to e-discovery In civil litigation, the parties in a dispute are required to provide

### Comparing Methods to Identify Defect Reports in a Change Management Database

Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com

### The Best Kept Secrets to Using Keyword Search Technologies

The Best Kept Secrets to Using Keyword Search Technologies By Philip Sykes and Richard Finkelman Part 1 Understanding the Search Engines A Comparison of dtsearch and Lucene Introduction Keyword searching

### CHAPTER 3 DATA MINING AND CLUSTERING

CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge

### ERNW Newsletter 29 / November 2009

ERNW Newsletter 29 / November 2009 Dear Partners and Colleagues, Welcome to the ERNW Newsletter no. 29 covering the topic: Data Leakage Prevention A Practical Evaluation Version 1.0 from 19th of november

### Optimization of Search Results with De-Duplication of Web Pages In a Mobile Web Crawler

Optimization of Search Results with De-Duplication of Web Pages In a Mobile Web Crawler Monika 1, Mona 2, Prof. Ela Kumar 3 Department of Computer Science and Engineering, Indira Gandhi Delhi Technical

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### Dr. Antony Selvadoss Thanamani, Head & Associate Professor, Department of Computer Science, NGM College, Pollachi, India.

Enhanced Approach on Web Page Classification Using Machine Learning Technique S.Gowri Shanthi Research Scholar, Department of Computer Science, NGM College, Pollachi, India. Dr. Antony Selvadoss Thanamani,

### So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

### Bayesian Spam Filtering

Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

### Near Duplicate Document Detection Survey

Near Duplicate Document Detection Survey Bassma S. Alsulami, Maysoon F. Abulkhair, Fathy E. Eassa Faculty of Computing and Information Technology King AbdulAziz University Jeddah, Saudi Arabia Abstract

### Chapter 20: Data Analysis

Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

### Classification and Prediction

Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

### Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

### Enhancing Document Review Efficiency with OmniX

Xerox Litigation Services OmniX Platform Review Technical Brief Enhancing Document Review Efficiency with OmniX Xerox Litigation Services delivers a flexible suite of end-to-end technology-driven services,

### Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

### Natural Language to Relational Query by Using Parsing Compiler

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

### Information Retrieval for E-Discovery Douglas W. Oard

Information Retrieval for E-Discovery Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies University of Maryland, College Park Joint work with: Mossaab Bagdouri,

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION EXECUTIVE SUMMARY Oracle business intelligence solutions are complete, open, and integrated. Key components of Oracle business intelligence

### E- Discovery in Criminal Law

E- Discovery in Criminal Law ! An e-discovery Solution for the Criminal Context Criminal lawyers often lack formal procedures to guide them through preservation, collection and analysis of electronically

### THIS WEBCAST WILL BEGIN SHORTLY

If you have any technical problems with the Webcast or the streaming audio, please contact us via email at: webcast@acc.com Thank You! THIS WEBCAST WILL BEGIN SHORTLY Cloud-Based vs. On-Premise ediscovery

### Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

### Technology Assisted Review: Don t Worry About the Software, Keep Your Eye on the Process

Technology Assisted Review: Don t Worry About the Software, Keep Your Eye on the Process By Joe Utsler, BlueStar Case Solutions Technology Assisted Review (TAR) has become accepted widely in the world