New Developments in the Automatic Classification of Email Records. Inge Alberts, André Vellino, Craig Eby, Yves Marleau



Similar documents
Managing e-records without an EDRMS. Linda Daniels-Lewis Senior IM Consultant Systemscope

Facilitating Business Process Discovery using Analysis

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

A Content based Spam Filtering Using Optical Back Propagation Technique

Feature Subset Selection in Spam Detection

Automated Content Analysis of Discussion Transcripts

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Sentiment analysis on tweets in a financial domain

Management of Records

CONCEPTCLASSIFIER FOR SHAREPOINT

The Enron Corpus: A New Dataset for Classification Research

UTILIZING COMPOUND TERM PROCESSING TO ADDRESS RECORDS MANAGEMENT CHALLENGES

High Productivity Data Processing Analytics Methods with Applications

A Method for Automatic De-identification of Medical Records

State of Montana Guidelines

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Delivering Smart Answers!

DATA MINING TECHNIQUES AND APPLICATIONS

How to gather and evaluate information

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Twitter sentiment vs. Stock price!

ILM et Archivage Les solutions IBM

Big Data Text Mining and Visualization. Anton Heijs

Forecasting stock markets with Twitter

Data Mining in Personal Management

How To Write A Summary Of A Review

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

WHITE PAPER: DATA SYSTEM AND PROTECTION. Symantec Enterprise Vault Intelligent Archiving and Classification, Retention, Filtering, and Search

The Introduction of a New Performance Management System. for Administrative & Professional, and Exempt Employees at Brock University

Spam detection with data mining method:

A Proposed Algorithm for Spam Filtering s by Hash Table Approach

Social Media Analytics

MICROSOFT OUTLOOK 2010

Achieve. Performance objectives

Governance in Digital Asset Management

OUTLOOK GETTING STARTED

Certified Information Professional 2016 Update Outline

How to Manage . Guidance for staff

Office of the Auditor General of Canada. Internal Audit of Document Management Through PROxI Implementation. July 2014

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

IMF Tune Opens Exchange to Any Anti-Spam Filter

Using SMART objectives Online Performance Review and Development Program (PRDP)

Record Retention and Digital Asset Management Tim Shinkle Perpetual Logic, LLC

Fraud Detection in Online Reviews using Machine Learning Techniques

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools

Technical Competency Framework for Information Management (IM)

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract

GUIDELINE RECORDS AND INFORMATION INVENTORY

RECORDS AND INFORMATION MANAGEMENT AND RETENTION

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

APPENDIX to CAHIIM 2012 Curriculum Requirements Health Informatics Master s Degree

About this documentation

Managing explicit knowledge using SharePoint in a collaborative environment: ICIMOD s experience

QUALITY ASSURANCE and QUALITY CONTROL DATA STANDARD

Three Methods for ediscovery Document Prioritization:

TOPIC NO TOPIC Physical Inventory Table of Contents Overview...2 Policy...2 Procedures...3 Internal Control...13 Records Retention...

APPENDIX I. Best Practices: Ten design Principles for Performance Management 1 1) Reflect your company's performance values.

Life after Microsoft Outlook Google Apps

User Guide for Kelani Mail

Content-Based Recommendation

RECORDS MANAGEMENT IN THE UNITED NATIONS

The Data Mining Process

E-discovery Taking Predictive Coding Out of the Black Box

Information Systems and Technologies in Organizations

Using Data Mining Methods to Predict Personally Identifiable Information in s

Introduction to Data Mining

Taxonomies for Auto-Tagging Unstructured Content. Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 1, 2013

Auto-Classification for Document Archiving and Records Declaration

ModusMail Software Instructions.

Information Management

QUALIFICATIONS PACK - OCCUPATIONAL STANDARDS FOR IT-ITeS INDUSTRY. SUB-SECTOR: Business Process Management. ITES)ces Helpdesk Attendant

ZEROING IN DATA TARGETING IN EDISCOVERY TO REDUCE VOLUMES AND COSTS

Robust Sentiment Detection on Twitter from Biased and Noisy Data

How To Understand The Impact Of A Computer On Organization

Writing Quality Learning Objectives

Predictive Coding, TAR, CAR NOT Just for Litigation

What are research, evaluation and audit?

Intercept Anti-Spam Quick Start Guide

Microsoft Outlook 2013 Part 1: Introduction to Outlook

QUALIFICATIONS PACK - OCCUPATIONAL STANDARDS FOR IT-ITeS INDUSTRY. SECTOR: IT-ITES ITES)ces Helpdesk Attendant SUB-SECTOR: Business Process Management

Blog Post Extraction Using Title Finding

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Master Degree Project Ideas (Fall 2014) Proposed By Faculty Department of Information Systems College of Computer Sciences and Information Technology

Transcription:

New Developments in the Automatic Classification of Email Records Inge Alberts, André Vellino, Craig Eby, Yves Marleau ARMA Canada 2014

INTRODUCTION 2014 2

OUTLINE 1. Research team 2. Research context / Problem statement 3. Overview of records auto-classification 4. Project objectives 5. Research methodology 6. Qualitative analysis 7. Automatic classification 8. Future work 3

RESEARCH TEAM CISRI Center of excellence and a catalyst for collaborative, interdisciplinary research in information science ÉSIS, University of Ottawa Information Studies program grounded in theory, supported by practical work experience, and integrally connected to the trends of the leading knowledge centres in the National Capital Region and beyond 4

RESEARCH TEAM Inge Alberts ÉSIS / CISRI André Vellino ÉSIS / Institute for Science, Society and Policy (ISSP) Craig Eby CISRI / Cogniva Yves Marleau CISRI / Cogniva 5

BUSINESS PROBLEMS Email management has always been a problem Terabytes of data in mailboxes and PST files Integration of email and ECM systems What is the role of the user? 6

GOC EXAMPLE Three new initiatives make the problem even more complex: Email Transformation Directive on Recordkeeping Open Government 7

IMPORTANCE OF BUSINESS CONTEXT Need to identify information of business value Need to better define the concept of business value Need to situate information within its context of use 8

DEFINING BUSINESS CONTEXT 9

RESEARCH PROJECT File Plan 1000 2000 Business Function 1000-001 1000-002 2000-001 2000-002 Function Function Sub- Function Sub- Function Sub- Function Sub- Function 10

ISIS Methodology produces models of an organization s business context ISIS Enterprise Software Solution helps organizations implement business context centric classification Classification automation Automated business rules Centralized Taxonomy and Rules Management Goal is to further reduce the requirements on users 11

OVERVIEW OF RECORDS AUTO- CLASSIFICATION Current state: Reaching similar classification quality as human users Mix of statistical and rule-based implementations Challenges Challenging to implement Confidence in and acceptance of results Lacking a structured approach to systematically ensure quality of system and interpret results 12

RELATED RESEARCH Analysis of Email - Ph.D. on automatic classification (Inge) - Extrusion protection at Entrust (André) - Business value pilot (André & Inge) Business Modeling - ISIS Methodology (Cogniva ) - Collaborations with LAC (Cogniva ) - Research on faceted classification (CISRI & University of Montreal) Process Discovery & Auto Classification - Auto-classification Research (Cogniva) - IRAP Research (Cogniva & CISRI) - ISIS Software (Cogniva) 13

RESEARCH OBJECTIVES 1. Understand how the concept of business value applies to the management of email records 2. Develop a model of information experts strategies while appraising email value in a work context 3. Propose a set of requirements to automatically classify organizational email records 4. Test these requirements on a corpus of emails 14

RESEARCH METHODOLOGY Phase 1: User Study Qualitative analysis of 8 information experts appraisal strategies Phase 2: Automatic Classification Quantitative analysis of ~900 email messages 15

Research Focus Phase 1 Criteria to identify the business value of email Decision process when appraising the value of email Lexical and nonlexical features used when appraising the business value of email Requirements needed to automatically classify organizational email 16

Methodology Phase 1 Semi-structured Interviews Cognitive Inquiries 8 experts ~1h 14 questions on business value & email 7 experts ~30 minutes Email classification exercise Email sample (n=174) Model Development Manual Classification Email BV factors Email feature analysis Classification model of appraisal strategies 2 email inboxes (n=1975) 1 corpus classified (n=~800) 17

PROFILE OF PARTICIPANTS (N=8) 18

RESULTS FROM PHASE 1 1. Email Business Value Factors 2. Email Features Analysis 3. Classification Model of Appraisal Strategies 19

BUSINESS VALUE Information resources of business value: Are published and unpublished materials, regardless of medium or form Created or acquired because they enable and document decision-making in support of programs, services and ongoing operations Support departmental reporting, performance and accountability requirements (Directive on Recordkeeping) 20

BUSINESS VALUE Process Context Operational = Performance Support actions & decisions Enhance performance Mitigate risks Evidential = Accountability Evidence of transaction Report on results ATIP, litigation Time 21

EMAIL BUSINESS VALUE FACTORS Origin Action Chronology Meaning 22

EMAIL ORIGIN Email origin is internal (team members, supervisors) or external (clients, professional network) Origin is the main factor affecting the appraisal of business value Appraisal decisions related to origin are based on: Name of the sender Position and organization of the sender Hierarchical relation between the sender & the recipient Active project involving the sender & the recipient 23

EMAIL ACTION Email action is passive (no engagement from the recipient) or performative (engagement & accountability from the recipient) Action is an important factor affecting the appraisal of business value, both operational & evidential Appraisal decisions related to email action are based on: Type of action Level of engagement & accountability of the recipient (high risk or low risk) 24

EMAIL CHRONOLOGY Chronology is operational (during project) or postmortem (after project) For IM consultants, keeping track of action history is a determinant factor to appraise business value, specially for active projects Challenging factor for defining business value Appraisal decisions related to email chronology are based on: Project status: active or closed 25

EMAIL MEANING Email meaning is explicit (rich vocabulary) or latent (based on context) Many solutions available on the market classify email based on explicit meaning but appraisal decisions are often based on latent meaning Appraisal decisions related to email meaning are based on: EXPLICIT: Keywords, Attachment, Thread, Type of Action IMPLICIT: Origin, Chronology, Level of Engagement 26

ANALYSIS OF EMAIL FEATURES Lexical Features Name & organization of the sender Action verbs: approval, confirmation, request, reminder, negotiation Action objects: SOW, meeting, status report, deadlines, deliverables, decision, reference material Presence of RE or FW in the title Name of attachment Nonlexical Features Message sent or received Hierarchical relation between the sender & the recipient Position of the recipient (TO, CC) Number of recipients (TO, CC) Project status: active or closed Presence of attachment Presence of a thread Presence of high priority symbol 27

HUMAN CLASSIFICATION MODEL (1/2) 28

HUMAN CLASSIFICATION MODEL (2/2) 29

MANUAL CLASSIFICATION CHALLENGES (1/2) More BV=NO than BV=Yes Bilingual messages Two recipients in the TO fields Threads: a message of business value is quickly superseded by a more recent one Sender is accountable for internal messages sent but for both sent and received when external 30

MANUAL CLASSIFICATION CHALLENGES (2/2) Emails of business value and attachments of business value have to be differentiated Perception of value is different between the individuals and the organization Evaluating the importance of some decisions or main revisions of drafts can be challenging Some emails of business value during active projects are ephemeral = operational versus evidential 31

Objectives Phase 2 Attempt the automation of binary classifications of Business Value No Business Value Compare the human labeling process with the machine learning 32

Methodology Phase 2 Corpus Creation Manual Classificatio n Feature Extraction Machine Training Cross- Validation Model Testing 33

A Machine Learning toolkit for non-experts For experimenting with text mining technology Based on Weka Data Mining Open Source software Developed by smart Ph.D. students at Carnegie Mellon University Offers Feature extraction Model building Automated analysis and labeling Prediction 34

EMAIL CORPUS FOR TRAINING 2 individual collections (inbox + outbox) 250 emails Business Value + 250 emails No Business Value 172 emails Business Value + 172 email No Business Value Features extracted: Originator and Recipients (To / From / Cc) Content of Subject / Body / Attachments Number of recipients in To and Cc fields Number of attachments Importance flag Forwarded indications Part of Thread indications 35

FROM AND TO FIELDS "From sender supervisor colleague client To solerecipient "soleorganizationalrecipient "supervisorisrecipient "oneamongmanyrecipient clientisrecipient" in every other case 36

GENERAL RESULTS Business Value SVM compare w/ Spam SVM Accuracy 0.91 Accuracy 0.96 Kappa 0.83 Kappa 0.93 Support Vector Machines (SVMs) are highly accurate predictors of Business Value / Not Business Value SVMs models are very specific to sender / recipient One model does not appear to suffice for organizational automatic classification of business value Training SVMs for greater accuracy makes it more difficult to explain the behaviour of the model 37

CONTRIBUTION OF ATTACHMENT CONTENT Without Attachment Content Analysis With Attachment Content Analysis 38

No business value Features Company Acronym Company Colleague Colleague Colleague 39

business value Features Email owner Client Acronym Client Client Acronym 40

FUTURE WORK Experiment w/ alternative classifiers besides SVMs Grow the corpus of BV and NBV from a wider variety of senders / recipients Add / subtract e-mail features Vary text analysis parameters Unigram / Bigram / Trigram PoS tagging Stop words Stemming Punctuation 41

ACKNOWLEDGMENTS Special thanks to: The participants for their time and their enthusiasm during this study The organization which granted us permission to analyze email data The research assistants for their active contribution This project is supported by a research grant from the University of Ottawa 42

THANK YOU 43

ACCURACY True-Pos. (229) + True-Neg. (228) (True + False) Pos. + (True + False) Neg (500) = 457 / 500 = 0.91 44

Cohen s Kappa The degree to which the machine classifier and the human classifiers agree. κ= Pr(M) Pr(R) / 1-Pr(R) Pr(M) = (229 + 228) / 500 = 0.91 Pr(R) = probability of random agreement = 0.5 κ= 0.82 45