Auto Classification and the Holy Grail for Records Managers Doug Magnuson Information Lifecycle Governance Solution Leader for North America 1
Focus since 1997 IBM Information Lifecycle Governance Solutions Executive Huron Consulting Group Managing Director: Strategic Consulting to the CLO of Fortune 500 companies in the area of Information Governance, ediscovery, Risk, and Compliance. Carefree Technologies President: Developed and implemented software applications for Enterprise Content Management solutions. Education BS Industrial Engineering dmagnuson@us.ibm.com (503) 578-2378 Doug Magnuson Information Lifecycle Governance Solutions Doug has nearly 30 years of experience in business process design, change management, and systems improvement.for the last 12 years he has focused on Enterprise Content, Document, Email, Records Management, and ediscovery systems. Doug has assisted with the development of business plans, software specifications, and provided guidance through software selectionprocesses. He has provided oversight for implementation, process improvement and system conversion efforts. Doug is a frequent speaker on topics related to Enterprise Content Management. Representative examples of Doug s experience: Developed information management strategies Architected solutions including the development of frameworks and reference models describing system interaction. Prepared business plans, process improvements, software specifications, and system requirements for content, document, and records management systems. Provided guidance through software and vendor selection processes including development of detailed RFP's, business use cases, and demonstration scripts. Conducted implementation and conversion oversight ensuring all critical business needs are addressed. Advised major information management vendors in the design, development, and improvement of their electronic content management software. 2
Auto Classification and the Holy Grail for Records Managers Session Plan Using the IBM Watson Jeopardy Challenge to help tell the story about the advancements the industry (not just IBM) is making in Natural Language Processing. Learning Goals Understand the basic concepts of Natural Language Processing and Content Analytics and how it supports the needs of records managers Apply these unprecedented capabilities for records managers. First, to understand authoritatively and quickly what is currently being retained by their organization; and second, to have confidence that temporary and transient items can be identified and disposed. Be a leader. Attending this session you will learn about the business case, the operational benefits, and the compelling need for using content analytics is your organization. 3
Agenda Obstacles to Managing the Information that Matters Best Practice Readiness Leveraging Content Analytics for Records and Info Management Summary 4
The Information Flood will Continue to Challenge Governance Processes 90% of the information in the world was created in the last 2 years. 44x The additional amount of information that will exist in the universe by 2020. 5
Very Simple Savings Proposition: Dispose of Unnecessary Data Enterprise Information 6
Transform Traditional Practices with New Outcomes Traditional Emphasis Records Retention High Value Shift Defensible Disposal Retention for legal and regulatory duties and business value is necessary but not sufficient in the economic climate. Disposal of unnecessary data reduces legal and IT costs, and aligns information costs with information value consistent with IT and business objectives to contain costs. Policy Publication Instrumentation Instrumentingretention, holds and disposal policy execution on application data and unstructured data ensures compliance and enables efficient, consistent disposal of unnecessary information to eliminate run rate costs immediately and sustainably. Risk Monitoring Cost Take Out Reframing our information governance objectives to not only reduce risk but to improve information economics can contribute significant savings to our IT cost reduction objectives through enabling systematic disposal of unnecessary data and the abilityto recover assets rapidly. 7
The Economic Benefits of Defensible Disposal are Compelling We could spend $35m less next year and lower our run rate We could lower run rate $3m now and spend $24m less over 3 years We could free up $150m to drive revenue and profit 8
Where Content Analytics Can Help On April 5, 2010 twenty-nine miners were killed in a terrible accident This same mine was fined a total of $382,000 for "serious" unrepentant violations In the previous month, the authorities cited the mine for 57 safety infractions The mine received two citations the day before the explosion and in the last five years has been cited for 1,342 safety violations Could improved records and information visibility may have prevented this accident? 9
Information Growth Outpaces Human Capacity There are not enough humans to deal with the problem Even if there were enough people to deal with the problem: Humans make mistakes and misinterpret meaning Manual action is costly Humans are inconsistent and choose to opt-out According to GAO report 1, agency policies on preserving email records are not followed consistently In a recent Cohasset study, manual classification costs 17 cents per document for an organization with 25M documents this would represent a cost of 3M$ Typically less than 25% of users actually declare records In a 2005 Department of the Navy study, only 12.5% of documents were classified with Exact accuracy at least 75% of the time 1 GAO Report: Federal Records: Agencies Face Challenges in Managing E-Mail, Apr 2008 10
The day of reckoning is here Keeping everything forever drives unsustainable costs routine disposal of information drives significant savings Traditional approaches do not work--more human beings is not the answer Content decommissioning and routine disposal has immediate and ongoing impact on IT budget Increased spending without routine disposal eventually consumes the IT budget 11
Principle #1: Data Growth Outpaces Storage Budgets and Business Processes Run rate costs double quickly if volume grows >30% Consumes CIO budget Storage: Direct Procurement Costs in Millions Information volume overwhelms information governance processes Undermines their effectiveness Governance processes have not matured to reflect volume, specifically how to: Define and execute legal holds and data collection Apply retention schedules to electronic information Align storage and manage information based on specific legal obligations and business value Provision, decommission and dispose of data High Risks & Mitigation Burden This leads to excess data and cost as well as operational challenges that in turn contribute to risk: Difficulty disposing of unnecessary data Complexity in applying legal holds Inefficiencies in data management and governance 16 governance processes impacted by high data volume such as placing holds, collecting evidence, decommissioning systems and their inherent risks, represented in A-O. 12
Principle #2: Increased Spending Consumes IT Budget 100% 80% 60% 40% 20% Most IT budgets are >80% committed to existing projects and are flat or declining Budgets can t contain rising keep everything costs and still provide for strategic IT investments Eventually keep everything models consume remaining IT budget dollars 0% 2010 2011 2012 2013 2014 Already Committed to Existing Projects Available for Strategic Investments Failure scenario 13
Principle #3: Routine Disposal has positive Impact on IT Budget 100% 80% 60% 40% 20% 0% 2010 2011 2012 2013 2014 Already Committed to Existing Projects Available for Strategic Investments Impact of Decommissioning and Routine Disposal Immediate reduction in supporting storage and infrastructure and needs / costs from content decommissioning Ongoing disposal ensures controlled information growth and preservation of IT budget Enables strategic IT investments to be made as needed Stakeholders must agree on defensible decommissioning and disposal of information processes or face failure scenario 14
The Form of Current Practices Intensifies the Challenge Disconnected siloesare the problem and the source of high cost and risk. Describes holds by custodians involved; communicates hold to custodians rather than IT. Generally focused on email and files for its holds efforts. Relies on IT to keep everything, unconcerned about IT cost but struggles with cost of ediscoveryon so much data. DUTY Matter 500-15,000 Hold 30,000 300,000 VALUE Department 8,000 ASSET Retention schedule doesn t reflect their need for information, so ignore it but may revolt if automated. Fighting to drive profit up and back office costs down. Angry about charge back costs, want better system performance and more from their data 5,000 12,000 Retention Schedule 300-800 DUTY Laws & Regs 100-page record schedule on intranet organized by class; relies upon volunteer effort to apply the schedule to electronic information. May have emphasis on retaining and regulatory compliance for 5-10% of enterprise information rather than enabling systematic deletion of unnecessary data. Has petabytesof data but no idea what is needed or why has to assume it is all valuable. Organizes data by system and server names. Paying full cost of compliance while struggling to reconcile doubling data with shrinking budget. Systems 2,000-8,000 Information 3PBs 100PBs Billion choices for IT to triangulate laws, lawsuits, business value with data 15
Stakeholder Alignment Yields Best Practice Benefits DUTY VALUE DUTY Matter Department Laws & Regs Hold Systems ASSET Retention Schedule Information LEGAL BUSINESS IT RECORDS Modernize ediscovery Process Precise, reliable legal holds Assess evidence in place, collect less Lower legal risk, cost State Information Value Guidance on information utility Participate in volume reduction Align around value Optimize Information Volume Dispose and retire unnecessary data Optimize storage based on value Lower information cost Modernize Retention Process Address electronic information Executable schedules can be automated Lower legal risk, cost 16
Results: Lowers Operational Cost and Risk Curbs storage growth, lowers run rate permanently Program leadership, process improvement and technology from IBM Storage Direct Procurement Costs Run rate reduction and growth avoidance Run rate Information Lifecycle Governance Program Executive charter for enterprise initiative Processes, capabilities and accountability to achieve cost and risk reduction benefits through Process improvements, expertise and technology: Value-Based Archiving & Defensible Disposal Archive to shrink storage, align cost to value Dispose rather than store unnecessary data Estimated Risk or Mitigation Burden Reduction Extend and automate retention management Include electronic data that has business value in addition to records for regulatory requirements Automate retention schedules across all information to enable reliable, systematic disposal. Automate the legal holds and ediscovery process Structure and automate legal holds process to lower risk, increase precision, enable disposal Analyze in place to reduce unnecessary collection, processing and review 17
Agenda Obstacles to Managing the Information that Matters Best Practice Readiness Leveraging Content Analytics for Records and Info Management Summary 18
Information Has a Lifespan Requiring Disposition Frequency of Access and Use 95% Expires 90% Born Digital Almost all has a retention policy very little should be kept forever Almost all is born digital and the rest should become digital Time 19
Begin with a shared system Policy and Process Integration Across Information Stakeholders Enables Disposal, Lowers Cost and Risk Strategy and Execution Drive Business Outcomes with Structure, Defined Processes, Metrics, Capacity & Accountability STRATEGY EXECUTION Governance Program Driving Savings and Risk Metrics Charter, directive and accountability for enterprise program. Savings achievement cadence and reporting. Program Office to Coordinate Stakeholders, Drive Benefit Achievement Ensures cross-silo engagement and progress toward maturity targets and financial objectives, change management Technology Provides Capacity to Improve and Integrate Processes, Consistently and Defensibly Dispose, Decommission Automates processes, ensures transparency, provides capacity. Accelerated deployment to drive faster save. Reclamation Removes Excess Storage, Infrastructure Savings-prioritized reclamation and recovery of infrastructure to drive P&L benefit >$300M enterprise value created over 3 years with lower legal and IT costs, reduced risk 20
Process Capabilities & Requirements PROCESS TRANSPARENCY Unified Governance Transparency across stakeholder processes Common governance data model and enterprise map Linkage of duties, value to information assets and business processes Governance analytics CREATE, USE Optimal accessibility Communicate value and duration Tap governance liaisons Access valuable information more easily Analytics on volume/cost of information HOLD, DISCOVER Rigorous Discovery Robust, affirmative legal holds for people, records, and data Preserve in place automation where disposition occurs Efficient data analysis and collection Legal cost and risk analytics RETAIN, ARCHIVE Value-Based Taxonomy and regulatory requirements Business value inventory Reliable, executable retention schedules for records and information of value Archive during period of value only Information cost and risk analytics STORE, SECURE Efficient Storage Store and optimize by value Meet SLAs for structured and unstructed information access ILG execution capability and enablement (holds, retention, disposal, collection) for data Data hygiene and governance DISPOSE Defensible Disposal Catalog of information value and duty by asset Legacy data clean up, application retirement Procedures and capabilities for disposal by source Risk and cost dashboard for information portfolio 21
Best Practices for the Information Lifecycle 1 1 Optimize business activities to: Automatically record and preserve evidence of transactions, events and processes Reduce the high costs of ediscovery Enforce records retention policies Reduce infrastructure costs 6 2 3 4 5 1. Perform periodic assessments to determine what information should be kept 2. Archive, collect and classify both data and content to decommission systems while preserving access to the information 3. Declare and control official business records 4. Respond to ediscovery requests more efficiently 5. Routinely dispose of information defensibly 6. Audit and govern the entire process while optimizing the underlying storage systems and infrastructure based on the value of information and associated legal duty 22
GARP Accountability Integrity Protection Compliance Availability Retention Disposition Transparency executive owner, delegated program responsibilities, documented program policies and procedures guarantee of authenticity and reliability protect records & information that are private, confidential, privileged, secret, or essential to business continuity compliance with applicable laws regulations and policies ensure timely, efficient, and accurate retrieval of needed information maintain records as dictated by legal, regulatory, fiscal, operational, and historical requirements secure and appropriate disposition for records that are no longer required documented recordkeeping program Source: ARMA International GARP Maturity Model 23
GARP Includes an Information Governance Maturity Model to help organizations: Evaluate recordkeeping programs and practices Identify gaps between current practices and the desirable level of maturity for each principle Assess the risk(s) to the organization, based on the biggest gaps Source: ARMA International GARP Maturity Model 24
Agenda Obstacles to Managing the Information that Matters Best Practice Readiness Leveraging Content Analytics for Records and Info Management Summary 25
Automation Best Practices for RIM Technologies for Identifying & Managing Information User-Driven Automated Collection & Declaration Advanced Classification Content Analytics Users make collection and records categorization decisions by reviewing each content item Administrative users build automated collection and records declaration policies that use rule-based metadata policies that use rule-based metadata and advanced contextual classification policies that use rule-based metadata, advanced contextual classification, and advanced content analytics LEVEL 2 (In Development) LEVEL 3 (Essential) LEVEL 4 (Proactive) LEVEL 5 (Transformational) 26
Level 2: User-driven Collection & Declaration User-driven Collection & Declaration Users decide and control how content is declared. Highly subjective and inaccurate. Assumes all users understand records policies LEVEL 2 (In Development) NARA User Participation 1 2 3 4 5 6 7 National Archives and Records Administration Electronic Records Management initiative focused on user driven records declaration 6+ month study Significant user drop-off after training period End users frequently outright refuse to categorize content Silos full of existing content abound resulting in large backlogs in addition to new content Manual declaration and an emphasis on user training an outdated practice 27
Level 3: Automated Collection & Declaration Automated Collection & Declaration Administrative users build automated collection and records declaration policies that use rule-based metadata Assess, monitor, identify, and collect information from all locations to facilitate RIM activities Legal preservation holds Records identification and declaration Information archiving Rule-based policies examine sources across enterprise silos, identify relevant information, and collect into a consolidated managed repository for RIM policy enforcement LEVEL 3 (Essential) 28
Level 4: Advanced Classification Advanced Classification policies that use rule-based metadata and advanced contextual classification LEVEL 4 (Proactive) Faster, More Accurate Collection Automation Increases collection accuracy with intelligent policy decisions based on both metadata and content context, without burdening users Flexible Automation Rapidly trains via learn-byexample approach, with flexible automation levels to accelerate adoption and acceptance Incorporates user feedback in real-time to improve understanding Auditable logic documents classification decisions for improved defensibility 29
Decision Plans Layer Multiple Methods for Records classification High Consistency Accuracy Consistent Participation & Enforcement Multiple Methods Imply Context Based Classification Ask Rules Based Classification Inspect Decision Plans combine approaches to classification Low Low Manual Classification Cost Savings Productivity High Context-based classification delivers high accuracy, rulesbased classification addresses hard-and-fast requirements. Combining methods delivers the best results. 30
Rule Systems: the Effect of Real-Time Learning High Manual Classification Rules Based Classification Multiple Methods Context Based Classification Use rule systems to act on existing meta data or keywords available in the process, content system or document properties. Low Low High 31
High Context Based Classification Manual Classification Rules Based Classification Multiple Methods Context Based Classification Use context based classification to inspect the document when there is not enough meta data already available Low Low High 32 Simple rules or keyword based analysis can be too coarse to make fine distinctions between long-form texts with very different intent 32
High Critical dimensions of classification Manual Classification Rules Based Classification Multiple Methods Context Based Classification Use manual classification for high value documents or when other methods do not provide enough information. Consider the volumes of information. Low Low High Manual Automated Accuracy Cost (per doc) 92% X 60 90% 46% $ 0.17 < $ 0.01 Consistency <50% 100% Increasing volume and variety of information magnifies the challenges of consistency and cost burdens 33
Classification at the US Army Challenge: Government Accountability Office (GAO) Report: 4 federation agencies surveyed revealed NARA regulation non-compliance, specifically with email Factors contributing to noncompliance included insufficient training and oversight as well as the difficulties of managing large volumes of e-mail. 1 Training 1.2 million users: * Logistical impossibility, given the scale of the organization * Poorly aligned to users skills and inefficient use of their time Solution: Utilize IBM Classification Module in IBM s email archiving and records management solution to automate record categorization without burdening users Benefits: 85% automation after Phase 1 99% automation after Phase 2 Each phase tested on approximately 600,000 email messages (different corpus each phase) As a records manager with a 25-year background in federal and civilian records management, I believe the automatic categorization of information is the next logical evolution in managing the records of an organization. 2 Brenda Fletcher, records manager, United States Army ROI Projections: 900 TB of disk savings, annually $1.8 M in hardware savings alone, independent of human costs and consistency of classification Very high satisfaction with each pass when reviewed manually by a Records Manager for accuracy 1. GAO Report: Federal Records: Agencies Face Challenges in Managing E-Mail, Apr 2008 2. IBM Case Study: Achieving compliance and controlling costs with automated categorization of e-mail for records management. A look at best practices and a U.S. Army ROI use case. 34
Level 5: Content Analytics Content Analytics policies that use rule-based metadata, advanced contextual classification, and advanced content analytics Bloated Production Systems with Inefficient Storage Unnecessary Information Content In The Wild Necessary Information The only way to effectively deal with massive amounts of records and information Search only has proven to fail LEVEL 5 (Transformational) 35 35
Traditional approaches are converging More than keyword search is needed Making unstructured data searchable is now a presumed primary interface for applications of all kinds, as well as for intranets and content repositories. Whit Andrews, Rita Knox Gartner Enterprise Search Content Analytics Business Intelligence Analyzing unstructured content no longer optional For many business process professionals, access to structured data, even when supported by BI or predictive analytics, lacks sufficient context for customer service, finance, and other areas where communications with customers involves many channels Craig Le Clair Forrester Increasing in business importance Early adopters of [text analytics] are already gaining a competitive advantage. Organizations that fail to do so will be at risk. Sue Feldman IDC Text Analytics Converging toward content analytics Every enterprise should understand how content analytics can produce answers to its critical questions; understanding this now will make it possible to exploit these tools as their availability proliferates. Rita Knox Gartner 36
Content Analytics can enable content archival, expiration and disposal Bloated Production Systems with Inefficient Storage Content In The Wild Unnecessary Information Necessary Information Content Analytics helps you gain control by eliminating unneeded content and content systems while preserving valued content One customer found 1200 copies of the same policy document, including 5 different versions, distributed across enterprise file servers 37
Organizations need text analysis and natural language processing to effectively deal with large volumes of records Over 80% of information being stored is unstructured Text analytics unlocks the power of that information for a variety of suctions and applications Data Content What is Text Analytics? Text Analytics (NLP*) describes a set of linguistic, statistical, and machine learning techniques that allow text to be analyzed and key information extracted for business integration * NLP = Natural Language Processing 38 38
Going from raw information to insightful information using natural language processing and analytics Uncover the value of records and information through visual-based approach Aggregate and extract from multiple sources Organize, analyze and visualize Search and explore to derive insight to form large text-based collections from multiple internal and external sources (and types), including ECM repositories, structured data, social media and more. enterprise content(and data) by identifying trends, patterns, correlations, anomalies and business context from collections. from collections to confirm what is suspected or uncover something new without being forced to build models or deploy complex systems. 39
Content Analytics Explained Claimant: Soft Tissue Injury Extracted Concept Analyzed Content (and Data) Person Injury Body Part Location Noun Verb Noun Phrase Prep Phrase John sprained his ankle on the step... Source Information Internal (ECM, Files, DBMS, etc.) and External (Social, News, etc.) Automatic Visualization for Interactive Exploration and Assessment 40
Content Analytics Enables Interactive Exploration Metadata * File type * File size * File location * Creation date * Author Content * Topic of document * Purpose of document * Organizations mentioned * Individual mentioned * Concepts mentioned Are particular file types or locations correlated to particular languages When did documents on a particular topic begin appearing in our file systems? Why is there content mentioning a particular organization from 12+ years ago? 41
Extend the same concepts to ediscovery Quickly get a view of the people, sender and recipient domains, and companies involved. Combine facets and filters to quickly include and eliminate custodians and data such as people from certain locations or other combination. Automatically extracted phrases in the content show the essence of the information. Organize a topographical view by key category. The peaks show frequency and phrases to quickly identify relevant information. 42
How to Decommission Unnecessary Content 1. Identify Content Sources to be assessed 2. IT Initial Assessment to decommission IT irrelevant content (duplicates, machine generated emails, etc.) 3. LOB and RIM Specific Assessments to decommission over-retained and obsolete content and to collect and classify valued and obligated content (requires knowledge of content value). 4. System & Application Decommissioning by IT 5. Periodic audits by IT, RIM and LOB keep content environments optimized 5 Periodic Audit 1 2 3 4 Identify Content Sources Initial IT Assessment Specific LOB Assessments System & Application Decommissioning Content Collection Content Collection 43
Benefits of Content Analytics Before Overflowing file systems burying valuable content Storage is 17% of IT budget Dormant and orphaned content on high-cost storage, necessitating software maintenance Knowledge workers searching for info 10 hrs/week Over-retained information stored beyond disposition After Up to 80% reduction in content storage costs Corresponding storage administration costs cut by 40-60% Elimination of up to $150K per retired software application Up to 20-40% of searching time eliminated Lower risk 44
Level 5+: A Look Towards the Future Deep Question Answering Applying natural language question answering technology to RIM Next LEVEL (Beyond Transformational) 45 45
Truly understanding natural language is the next great computing challenge Over 80% of information today is unstructured and based on natural language The impact of Systems of Engagementboth inside and outside the firewall is dramatic such masses of information not easily understandable by humans Legacy approaches have all failed; searching not the right approach A new approach is needed, leveraging content analysisand natural language processing 46
The Next Grand Challenge 47
Real language is real hard Chess A finite, mathematically well-defined search space Limited number of moves and states Grounded in explicit, unambiguous mathematical rules Human Language Ambiguous, contextual and implicit Contains slang, riddles, idioms, abbreviations, acronyms and more Grounded only in human cognition Seemingly infinitenumber of ways to express the same concepts and meaning 48
Unstructured vs Structured The hard part: understanding natural language with confidence and accuracy Where was Einstein born? One day, from among his city views of Ulm, Otto chose a watercolor to send to Albert Einstein as a remembrance of Einstein s birthplace. Welch ran this? Unstructured If leadership is an art then surely Jack Welch has proved himself a master painter during his tenure at GE. Structured 49
The Jeopardy! Challenge 5 key dimensions to drive the technology Broad/open domain Complex language High precision $200 If you're standing, it's the direction you should look to check out the wainscoting $800 In cell division, mitosis splits the nucleus & cytokinesissplits this liquid cushioning the nucleus Accurate confidence High speed $1000 Of the 4 countries in the world that the U.S. does not have diplomatic relations with, the one that s farthest north 50
Examples from Jeopardy! clues and missing links This fish was thought to be extinct millions of years ago until one was found off South Africa in 1938 Category: ENDS IN "TH" Answer: coelacanth When hit by electrons, a phosphor gives off electromagnetic energy in this form Category: General Science Answer: light (or photons) Secy. Chase just submitted this to me for the third time--guess what, pal. This time I'm accepting it Category: Lincoln Blogs Answer: his resignation 51 51
The Jeopardy! winner s cloud Best human performance Each dot represents an actual human Jeopardy! game Top human players are remarkably good Winning Human Performance Grand Champion Human Performance Past computer results 2007 QA Computer System More Confident Less Confident 52
The technology behind IBM Watson How it Really Works with Content Question Primary Search Multiple Natural Language Interpretations Question & Topic Analysis Answer Sources Question Decomposition Candidate Answer Generation 100 s of Sources Hypothesis Generation Answer Scoring 1000 s of Pieces of Evidence Evidence Sources Evidence Retrieval Deep Evidence Scoring 100,000 s Scores from Many Deep Analysis Algorithms Hypothesis and Evidence Scoring Balance, Weigh & Combine Synthesis Learned Models help combine and weigh the Evidence Models Models Models Models Models Models Final Confidence Merging & Ranking Hypothesis Generation... Hypothesis and Evidence Scoring Answer with Confidence 53
Isn t this just like search? Question: What happens if my shoelaces become untied? Search only results: Based on keyword popularity and search engine optimized Lots of shopping suggestions Results prove it didn t understand the question Can include profanity PROFANITY Note: This is mocked up from two separate search query approaches 54
Evidence Profiles summarize evidence analysis across many sources Clue: Chile shares its longest land border with this country. Bolivia is more popular due to a commonly discussed border dispute but Argentina has more reliable sources Correct Answer: Argentina 55
Different Types of Evidence: Keyword Evidence In May 1898 Portugal celebrated the 400th anniversary of this explorer s arrival in India. In May, Gary arrived in India after he celebrated his anniversary in Portugal. arrived in celebrated Keyword Matching celebrated In May 1898 Keyword Matching In May 400th anniversary Keyword Matching anniversary Evidence suggests Gary is the answer BUT the system must learn that keyword matching may be weak relative to other types of evidence arrival in India explorer Portugal Keyword Matching Keyword Matching Gary India in Portugal 56 56
Different Types of Evidence: Deeper Evidence In May 1898 Portugal celebrated the 400th anniversary of this explorer s arrival in India. On On 27th 27th May May 1498, 1498, Vasco Vasco da dagama On landed 27th May landed in in Kappad 1498, Vasco Beach Beachda Gama landed in Kappad Beach On the 27 th of May 1498, Vasco da Gama landed in Kappad Beach Search Far and Wide Explore many hypotheses celebrated Portugal Find Judge Evidence Many inference algorithms landed in May 1898 400th anniversary Temporal Reasoning 27th May 1498 Stronger evidence can be much harder to find and score. arrival in India explorer Statistical Paraphrasing GeoSpatial Reasoning Date Math Paraphrase s Geo- KB Kappad Beach Vasco da Gama 57 57
IBM at 100: ECM Innovation for Over 50 Years Beginning in 1957 Searching and Classifying Content Syndication Imaging Tarian Software Workflow / BPM 58 IBM Confidential Watson Advanced Case Management Content Analytics (TAKMI) Records Management / ediscovery Aptrix Green Pasture Digital Libraries Video Content ECM Standards Pure iphrase FileNet Production Imaging 2011 PSS Systems Datacap Venetica Edge Over $15B Invested Since 2006 58
Agenda Obstacles to Managing the Information that Matters Best Practice Readiness Leveraging Content Analytics for Records and Information Management Summary, and How to Get Started 59
RIM Benefits of Content Analytics Smarter Decisions Lower Costs Reduced Risk Increased Productivity Invest smarter: Develop ROI case for information governance Cut storage costs up to 80% by eliminating the unnecessary Lower risk via more consistent disposition Eliminate manual analysis and classification at 17 cents/doc Plan smarter: Prioritize riskiest areas for your focus Reduce administration burden for storage 40-60% Eliminate up to $150K associated to each system and application decommissioned ediscovery collection in hours vs. days Cut ediscovery review by 5-10%: fewer docs for collection, leads to lower ediscovery review costs Slash the 10 hours per week knowledge workers spend searching by removing the irrelevant content Rapid creation of rules, policies and models 60
Best practice process for leveraging Content Analytics Ensure stake holder alignment Develop or adopt a valuation model test the model Choose the right technology based on your requirements Monitor, refine, audit and report on the results Track and promote the ROI 61
Next Steps: Validate The Potential Savings and How to Achieve Them 62
Summary The day of reckoning is here take action now Humans are not the answer the only way forward is with content analytics Dynamically Analyze to empower your information stakeholders to make decisions Decommission what s unnecessary Preserve and Exploit the content that matters and your IT budget 100% 80% 60% Unnecessary Information Necessary Information 40% 20% 2010 2011 2012 2013 2014 63
References Title Compliance, Governance and Oversight Council (CGOC) Benchmark Report on Information Governance Litigation Cost Survey of Major Companies, May 2010 (from the Conference on Civil Litigation, Duke Law School, May 2010) Link http://www.cgoc.com/ http://civilconference.uscourts.gov/lotusquickr/dcc/main.nsf/$defaultvi ew/33a2682a2d4ef700852577190060e4b5/$file/litigation%20cost%20s urvey%20of%20major%20companies.pdf?openelement InformationWeek December 2009 IDC 2010 Digital Universe Study, sponsored by EMC Fulbright s 6th Annual Litigation Trends Survey Report, Oct 2009, with permission U.S. General Accounting Office (GAO): Federal Records: Agencies Face Challenges in Managing E- Mail, Apr 2008 Generally Accepted Recordkeeping Principles (GARP ) from ARMA International Information Management Reference Model (IMRM) from http://edrm.net http://www.fulbright.com/litigationtrends19 http://www.gao.gov/products/gao-08-699t http://www.arma.org/garp http://edrm.net/activities/projects/imrm Achieving compliance and controlling costs with automated categorization of e-mail for records management. A look at best practices and a U.S. Army ROI use case. Unstructured Information Management Architecture (UIMA) Open Source Project How Content Assessment Can Reduce Your Risk and Help Manage Storage More Efficiently, Feb 2010, by Osterman Research http://gigaom.files.wordpress.com/2010/05/2010-digital-universeiview_5-4-10.pdf http://www.ibm.com/common/ssi/fcgibin/ssialias?infotype=sa&subtype=wh&appname=swge_im_im_usen& htmlfid=imw14166usen&attachment=imw14166usen.pdf http://domino.research.ibm.com/comm/research_projects.nsf/pages/ui ma.index.html http://www.informationmanagementrequest.com/mk/get?_a=39004&_ U=33910 More information about IBM Content Analytics for Assessment More information about IBM s Information Lifecycle Governance solutions http://www-01.ibm.com/software/data/contentmanagement/assessment.html http://www.informationmanagementrequest.com/campaigns/complianc e_warehouse/site/index.html 64
IBM is a Unique, Strategic Partner in Enabling Defensible Disposal 65
Thank you 66