Natural Language Processing of UK Annual Report Narratives. Paul Rayson Mahmoud El-Haj

Similar documents
CLARIN (in the) UK Tools and Services

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Corpus and Discourse. The Web As Corpus. Theory and Practice MARISTELLA GATTO LONDON NEW DELHI NEW YORK SYDNEY

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

Automatic Text Analysis Using Drupal

DEFINING EFFECTIVENESS FOR BUSINESS AND COMPUTER ENGLISH ELECTRONIC RESOURCES

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources

AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom

Enterprise Content Management (ECM) Strategy

Technical Report. The KNIME Text Processing Feature:

Special Topics in Computer Science

Dedoose Distinguishing features and functions

Text Mining - Scope and Applications

XBRL guide for UK businesses

Supporting Law Enforcement in Digital Communities through Natural Language Analysis

TDPA: Trend Detection and Predictive Analytics

On-Site Search Engine Optimisation Tip Sheet Key Multimedia Ltd

PROMT Technologies for Translation and Big Data

Search and Information Retrieval

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Semantic annotation of requirements for automatic UML class diagram generation

D3.1: SYSTEM TEST SUITE

Question template for interviews

Direct-to-Company Feedback Implementations

Find the signal in the noise

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

STAR WARS AND THE ART OF DATA SCIENCE

MIS Privacy Statement. Our Privacy Commitments

The Knowledge Sharing Infrastructure KSI. Steven Krauwer

ACEDS Membership Benefits Training, Resources and Networking for the E-Discovery Community

Big Data uses cases and implementation pilots at the OECD

Exploration and Visualization of Post-Market Data

Statistics on Drug and Alcohol Treatment for Adults and Young People (produced by Public Health England)

Yandex: Webmaster Tools Overview and Guidelines

Digital Communication and Interoperability - A Case Study

USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE

Terminology Extraction from Log Files

New Options for Mutual Fund Managers

Processing: current projects and research at the IXA Group

Joint efforts to further develop and incorporate Apertium into the document management flow at Universitat Oberta de Catalunya

LEVERAGING BIG DATA!

Logistics of setting up a company Dr Carl Firth ASLAN Pharmaceuticals

Regulatory News Service (RNS)

Unstructured Threat Intelligence Processing using NLP

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Survey Results: Requirements and Use Cases for Linguistic Linked Data

The role of modern board. Chartered Institute of Management Accountants

PRODUCT DESCRIPTION OF SERVICES PROVIDED BY IPEER

TechWatch. Technology and Market Observation powered by SMILA

Augmented Search for IT Data Analytics. New frontier in big log data analysis and application intelligence

The Analysis of Online Communities using Interactive Content-based Social Networks

WHITEPAPER. Text Analytics Beginner s Guide

Binge drinking increases risk of dementia

The New ROI: Results Oriented Intel. David Amsler, Founder

A DTD for Qualitative Data:

CSR REPORT 2016 Corporate Social Responsibility Report

INTERNET MARKETING. SEO Course Syllabus Modules includes: COURSE BROCHURE

Life's cheap, but who's buying?

Computer Concepts And Applications CIS-107-TE. TECEP Test Description

Welcome to the webinar Does your department or company use the valuable data it collects to plan for future needs and trends?

Backbase Accessibility

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

How long men live. MALE life expectancy at birth Newcastle compared to England and other Core Cities

Positive Versus Negative News: Readability and Obfuscation in Financial Reports

PREDICTING MARKET VOLATILITY FEDERAL RESERVE BOARD MEETING MINUTES FROM

GEDDM - Commercial Data Mining Using Distributed Resources

Adam Walsh Child Protection and Safety Act. Or SORNA Sex Offender Registration and Notification Act By Chris Phillis Maricopa Public Defender

Social Media Monitoring, Planning and Delivery

How to Drive More Traffic to Your Event Website

Web Developer's Status Report. Laurie Laforest, Ph.D. Academic Technology 22 May 2009

Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science

VICTORIAN HEALTHCARE ASSOCIATION

Service Road Map for ANDS Core Infrastructure and Applications Programs

THE UNIVERSITY OF MANCHESTER PARTICULARS OF APPOINTMENT FACULTY OF MEDICAL AND HUMAN SCIENCES SCHOOL OF PSYCHOLOGICAL SCIENCES

APT How to Better Manage Risk by Outsourcing Risk Measurement. London 11 June 2014

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

National curriculum tests. Key stage 2. English reading test framework. National curriculum tests from For test developers

Developing a User-based Method of Web Register Classification

31 Case Studies: Java Natural Language Tools Available on the Web

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns

CORPORATE GOVERNANCE STATEMENT

Building a Question Classifier for a TREC-Style Question Answering System

The authors provide the frameworks, analysis tools and route-maps to understand and action creating a marketdriven

OPEN SOURCE SOFTWARE CUSTODIAN AS A SERVICE

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

Corporate Governance Statement

Build Vs. Buy For Text Mining

Current Themes in Social Marketing Research: Text-mining the past five years

Voluntary Product Accessibility Template Blackboard Learn Release 9.1 April 2014 (Published April 30, 2014)

Company Tax Returns: format for accounts forming part of an online return

NHS Procurement Dashboard: Overview

ACT Enrollment Planners Conference

Doing Multidisciplinary Research in Data Science

CDA for Common Document Types: Objectives, Status, and Relationship to Computer-assisted Coding

Incorporating ABI, NAPF and FRC feedback into your AGM and reporting

Product to Customer. through MDM. Presented by Luminita Vollmer, MBA, CDMP, CBIP

Transcription:

Natural Language Processing of UK Annual Report Narratives Paul Rayson Mahmoud El-Haj

Session outline What is NLP? How we have applied NLP methods to the financial domain in the CFIE project Demo of our systems Wmatrix-import Wmatrix

Natural Language Processing getting computers to understand language a branch of computer science dealing with analysing, understanding and generating natural language

NLP and Corpus Linguistics @ Lancaster Analyzing digital personas behind criminal tactics to provide investigators with clues to the identity of the individual or group hiding behind one or more personas Child sex offenders masquerading as young people to gain their victims trust Radicalization of youth in online forums through persuasive messaging Online dating sites that gain trust and the exploit users for financial gain Analysis of cyber bullying and online trolling Use textual analysis to predict a persona s key attributes (e.g., age and gender) even where the author is obfuscating the language Assist in early diagnosis of dementia and Alzheimer s disease Longitudinal changes in linguistic features (e.g., grammatical complexity and idea density) are associated with degenerative memory conditions

Corporate Financial Information Environment (CFIE) ESRC- and ICAEW-funded project Martin Walker, Manchester Business School Steven Young, Lancaster University Management School Paul Rayson, Lancaster University School of Computing & Communications Mahmoud El-Haj, Lancaster University School of Computing & Communications Vasiliki Athanasakou, London School of Economics Thomas Schleicher, Manchester Business School Project seeks to analyse UK financial narratives, their association with financial statement information, and their informativeness for investors Automated, large sample analysis of UK annual report narratives represents a cornerstone of the project Develop software for general use by academics http://ucrel.lancs.ac.uk/cfie/

Analysing UK Annual Report Narratives Harvest reports Depends on file type (*.pdf, *.rtf, *.txt, etc.) Clean and parse text Analyse text Word counts based on wordlists Measuring constructs using offthe-shelf software Collect documents manually or use crawler to automate collection e.g., use markup tags to parse sections, remove exhibits and pictures; clean prior to analysis, etc. Machine learning (training software to recognise patterns) Text mining to identify patterns

US Filings: 10-K Annual Form Each 10-K contains 4 parts and 15 items PART I ITEM 1. Description of Business ITEM 2. Description of Properties ITEM 3. Legal Proceedings ITEM 4. Mine Safety Disclosures PART II ITEM 5. Market for Registrant s Common Equity. ITEM 6. Selected Financial Data ITEM 7. Management s Discussion and Analysis. ITEM 8. Financial Statements and Supplementary Data ITEM 9. Changes in and Disagreements. PART III ITEM 10. Directors, Executive Officers and Corporate Governance ITEM 11. Executive Compensation ITEM 12. Security Ownership of Certain Beneficial Owners. ITEM 13. Certain Relationships and Related Transactions. ITEM 14. Principal Accounting Fees and Services PART IV ITEM 15. Exhibits, Financial Statement Schedules.

c.f. UK Annual Reports Free Style (no standard structure) Use of Images, Text PDF Format Content and structure varies across firms. Management have more discretion over what, where, and how much information on topics such as risk, strategy, performance, etc. is reported. This makes the extraction and analysis task more challenging; but it provides research opportunities.

Use contents page to extract text by section from digital pdf

Steps in extraction process: Detect contents page Parse contents page Detect page numbering to determine section start/end Add headers as bookmarks to pdf Extract text for each section Analyse extracted text by section and for entire document

In addition to performing text extraction, the tool provides a range of text analysis options: Readability metrics Word counts using pre-determined lists (e.g., forward looking, uncertainty, tone, etc.) Word counts based on user-defined wordlists Comparison with reference corpus (word, parts of speech and semantic level) Concordance and collocates Upload and analyse user-defined text file

Overview (wordlists, readability metrics) and interface with Wmatrix Uploading one or more annual reports and generating output Uploading and analysing with a user-defined key word list Uploading and analysing a user-defined text file Examples of further analysis in Wmatrix: Cloud for chairman s statement vs. standard reference corpus: word level Cloud for chairman s statement vs. to standard reference corpus: semantic level Cloud for chairman s statement vs. chairman s statement corpus: word level Cloud for chairman s statement vs. chairman s statement corpus: semantic level Concordance/collocation

Case study: Strategic content Measure strategic content of annual reports by automatic analysis of words and phrases in financial commentary and business strategy sections Prior UK studies: manual, 300 reports 60,000 sections in 10,000 reports 1156 key words and phrases Clear evidence of doubling of strategic content since 2002

Future developments Choice of different readability metrics Text re-use metric to detect boilerplating and incremental changes Reference corpora for: Full Annual Report and Accounts Chairman s Statement CEO Review/Business Review/Operating Review/Financial Review Highlights Corporate Governance Directors Remuneration Readability scores for all UK annual reports for the period 2003-2013 Further methods for representing key themes (e.g., strategy, risk, CSR and governance, etc.) by section