New Developments in the Automatic Classification of Email Records Inge Alberts, André Vellino, Craig Eby, Yves Marleau ARMA Canada 2014
INTRODUCTION 2014 2
OUTLINE 1. Research team 2. Research context / Problem statement 3. Overview of records auto-classification 4. Project objectives 5. Research methodology 6. Qualitative analysis 7. Automatic classification 8. Future work 3
RESEARCH TEAM CISRI Center of excellence and a catalyst for collaborative, interdisciplinary research in information science ÉSIS, University of Ottawa Information Studies program grounded in theory, supported by practical work experience, and integrally connected to the trends of the leading knowledge centres in the National Capital Region and beyond 4
RESEARCH TEAM Inge Alberts ÉSIS / CISRI André Vellino ÉSIS / Institute for Science, Society and Policy (ISSP) Craig Eby CISRI / Cogniva Yves Marleau CISRI / Cogniva 5
BUSINESS PROBLEMS Email management has always been a problem Terabytes of data in mailboxes and PST files Integration of email and ECM systems What is the role of the user? 6
GOC EXAMPLE Three new initiatives make the problem even more complex: Email Transformation Directive on Recordkeeping Open Government 7
IMPORTANCE OF BUSINESS CONTEXT Need to identify information of business value Need to better define the concept of business value Need to situate information within its context of use 8
DEFINING BUSINESS CONTEXT 9
RESEARCH PROJECT File Plan 1000 2000 Business Function 1000-001 1000-002 2000-001 2000-002 Function Function Sub- Function Sub- Function Sub- Function Sub- Function 10
ISIS Methodology produces models of an organization s business context ISIS Enterprise Software Solution helps organizations implement business context centric classification Classification automation Automated business rules Centralized Taxonomy and Rules Management Goal is to further reduce the requirements on users 11
OVERVIEW OF RECORDS AUTO- CLASSIFICATION Current state: Reaching similar classification quality as human users Mix of statistical and rule-based implementations Challenges Challenging to implement Confidence in and acceptance of results Lacking a structured approach to systematically ensure quality of system and interpret results 12
RELATED RESEARCH Analysis of Email - Ph.D. on automatic classification (Inge) - Extrusion protection at Entrust (André) - Business value pilot (André & Inge) Business Modeling - ISIS Methodology (Cogniva ) - Collaborations with LAC (Cogniva ) - Research on faceted classification (CISRI & University of Montreal) Process Discovery & Auto Classification - Auto-classification Research (Cogniva) - IRAP Research (Cogniva & CISRI) - ISIS Software (Cogniva) 13
RESEARCH OBJECTIVES 1. Understand how the concept of business value applies to the management of email records 2. Develop a model of information experts strategies while appraising email value in a work context 3. Propose a set of requirements to automatically classify organizational email records 4. Test these requirements on a corpus of emails 14
RESEARCH METHODOLOGY Phase 1: User Study Qualitative analysis of 8 information experts appraisal strategies Phase 2: Automatic Classification Quantitative analysis of ~900 email messages 15
Research Focus Phase 1 Criteria to identify the business value of email Decision process when appraising the value of email Lexical and nonlexical features used when appraising the business value of email Requirements needed to automatically classify organizational email 16
Methodology Phase 1 Semi-structured Interviews Cognitive Inquiries 8 experts ~1h 14 questions on business value & email 7 experts ~30 minutes Email classification exercise Email sample (n=174) Model Development Manual Classification Email BV factors Email feature analysis Classification model of appraisal strategies 2 email inboxes (n=1975) 1 corpus classified (n=~800) 17
PROFILE OF PARTICIPANTS (N=8) 18
RESULTS FROM PHASE 1 1. Email Business Value Factors 2. Email Features Analysis 3. Classification Model of Appraisal Strategies 19
BUSINESS VALUE Information resources of business value: Are published and unpublished materials, regardless of medium or form Created or acquired because they enable and document decision-making in support of programs, services and ongoing operations Support departmental reporting, performance and accountability requirements (Directive on Recordkeeping) 20
BUSINESS VALUE Process Context Operational = Performance Support actions & decisions Enhance performance Mitigate risks Evidential = Accountability Evidence of transaction Report on results ATIP, litigation Time 21
EMAIL BUSINESS VALUE FACTORS Origin Action Chronology Meaning 22
EMAIL ORIGIN Email origin is internal (team members, supervisors) or external (clients, professional network) Origin is the main factor affecting the appraisal of business value Appraisal decisions related to origin are based on: Name of the sender Position and organization of the sender Hierarchical relation between the sender & the recipient Active project involving the sender & the recipient 23
EMAIL ACTION Email action is passive (no engagement from the recipient) or performative (engagement & accountability from the recipient) Action is an important factor affecting the appraisal of business value, both operational & evidential Appraisal decisions related to email action are based on: Type of action Level of engagement & accountability of the recipient (high risk or low risk) 24
EMAIL CHRONOLOGY Chronology is operational (during project) or postmortem (after project) For IM consultants, keeping track of action history is a determinant factor to appraise business value, specially for active projects Challenging factor for defining business value Appraisal decisions related to email chronology are based on: Project status: active or closed 25
EMAIL MEANING Email meaning is explicit (rich vocabulary) or latent (based on context) Many solutions available on the market classify email based on explicit meaning but appraisal decisions are often based on latent meaning Appraisal decisions related to email meaning are based on: EXPLICIT: Keywords, Attachment, Thread, Type of Action IMPLICIT: Origin, Chronology, Level of Engagement 26
ANALYSIS OF EMAIL FEATURES Lexical Features Name & organization of the sender Action verbs: approval, confirmation, request, reminder, negotiation Action objects: SOW, meeting, status report, deadlines, deliverables, decision, reference material Presence of RE or FW in the title Name of attachment Nonlexical Features Message sent or received Hierarchical relation between the sender & the recipient Position of the recipient (TO, CC) Number of recipients (TO, CC) Project status: active or closed Presence of attachment Presence of a thread Presence of high priority symbol 27
HUMAN CLASSIFICATION MODEL (1/2) 28
HUMAN CLASSIFICATION MODEL (2/2) 29
MANUAL CLASSIFICATION CHALLENGES (1/2) More BV=NO than BV=Yes Bilingual messages Two recipients in the TO fields Threads: a message of business value is quickly superseded by a more recent one Sender is accountable for internal messages sent but for both sent and received when external 30
MANUAL CLASSIFICATION CHALLENGES (2/2) Emails of business value and attachments of business value have to be differentiated Perception of value is different between the individuals and the organization Evaluating the importance of some decisions or main revisions of drafts can be challenging Some emails of business value during active projects are ephemeral = operational versus evidential 31
Objectives Phase 2 Attempt the automation of binary classifications of Business Value No Business Value Compare the human labeling process with the machine learning 32
Methodology Phase 2 Corpus Creation Manual Classificatio n Feature Extraction Machine Training Cross- Validation Model Testing 33
A Machine Learning toolkit for non-experts For experimenting with text mining technology Based on Weka Data Mining Open Source software Developed by smart Ph.D. students at Carnegie Mellon University Offers Feature extraction Model building Automated analysis and labeling Prediction 34
EMAIL CORPUS FOR TRAINING 2 individual collections (inbox + outbox) 250 emails Business Value + 250 emails No Business Value 172 emails Business Value + 172 email No Business Value Features extracted: Originator and Recipients (To / From / Cc) Content of Subject / Body / Attachments Number of recipients in To and Cc fields Number of attachments Importance flag Forwarded indications Part of Thread indications 35
FROM AND TO FIELDS "From sender supervisor colleague client To solerecipient "soleorganizationalrecipient "supervisorisrecipient "oneamongmanyrecipient clientisrecipient" in every other case 36
GENERAL RESULTS Business Value SVM compare w/ Spam SVM Accuracy 0.91 Accuracy 0.96 Kappa 0.83 Kappa 0.93 Support Vector Machines (SVMs) are highly accurate predictors of Business Value / Not Business Value SVMs models are very specific to sender / recipient One model does not appear to suffice for organizational automatic classification of business value Training SVMs for greater accuracy makes it more difficult to explain the behaviour of the model 37
CONTRIBUTION OF ATTACHMENT CONTENT Without Attachment Content Analysis With Attachment Content Analysis 38
No business value Features Company Acronym Company Colleague Colleague Colleague 39
business value Features Email owner Client Acronym Client Client Acronym 40
FUTURE WORK Experiment w/ alternative classifiers besides SVMs Grow the corpus of BV and NBV from a wider variety of senders / recipients Add / subtract e-mail features Vary text analysis parameters Unigram / Bigram / Trigram PoS tagging Stop words Stemming Punctuation 41
ACKNOWLEDGMENTS Special thanks to: The participants for their time and their enthusiasm during this study The organization which granted us permission to analyze email data The research assistants for their active contribution This project is supported by a research grant from the University of Ottawa 42
THANK YOU 43
ACCURACY True-Pos. (229) + True-Neg. (228) (True + False) Pos. + (True + False) Neg (500) = 457 / 500 = 0.91 44
Cohen s Kappa The degree to which the machine classifier and the human classifiers agree. κ= Pr(M) Pr(R) / 1-Pr(R) Pr(M) = (229 + 228) / 500 = 0.91 Pr(R) = probability of random agreement = 0.5 κ= 0.82 45