Natural Language Processing of UK Annual Report Narratives. Paul Rayson Mahmoud El-Haj
|
|
|
- Osborne McKenzie
- 9 years ago
- Views:
Transcription
1 Natural Language Processing of UK Annual Report Narratives Paul Rayson Mahmoud El-Haj
2 Session outline What is NLP? How we have applied NLP methods to the financial domain in the CFIE project Demo of our systems Wmatrix-import Wmatrix
3 Natural Language Processing getting computers to understand language a branch of computer science dealing with analysing, understanding and generating natural language
4 NLP and Corpus Lancaster Analyzing digital personas behind criminal tactics to provide investigators with clues to the identity of the individual or group hiding behind one or more personas Child sex offenders masquerading as young people to gain their victims trust Radicalization of youth in online forums through persuasive messaging Online dating sites that gain trust and the exploit users for financial gain Analysis of cyber bullying and online trolling Use textual analysis to predict a persona s key attributes (e.g., age and gender) even where the author is obfuscating the language Assist in early diagnosis of dementia and Alzheimer s disease Longitudinal changes in linguistic features (e.g., grammatical complexity and idea density) are associated with degenerative memory conditions
5 Corporate Financial Information Environment (CFIE) ESRC- and ICAEW-funded project Martin Walker, Manchester Business School Steven Young, Lancaster University Management School Paul Rayson, Lancaster University School of Computing & Communications Mahmoud El-Haj, Lancaster University School of Computing & Communications Vasiliki Athanasakou, London School of Economics Thomas Schleicher, Manchester Business School Project seeks to analyse UK financial narratives, their association with financial statement information, and their informativeness for investors Automated, large sample analysis of UK annual report narratives represents a cornerstone of the project Develop software for general use by academics
6 Analysing UK Annual Report Narratives Harvest reports Depends on file type (*.pdf, *.rtf, *.txt, etc.) Clean and parse text Analyse text Word counts based on wordlists Measuring constructs using offthe-shelf software Collect documents manually or use crawler to automate collection e.g., use markup tags to parse sections, remove exhibits and pictures; clean prior to analysis, etc. Machine learning (training software to recognise patterns) Text mining to identify patterns
7 US Filings: 10-K Annual Form Each 10-K contains 4 parts and 15 items PART I ITEM 1. Description of Business ITEM 2. Description of Properties ITEM 3. Legal Proceedings ITEM 4. Mine Safety Disclosures PART II ITEM 5. Market for Registrant s Common Equity. ITEM 6. Selected Financial Data ITEM 7. Management s Discussion and Analysis. ITEM 8. Financial Statements and Supplementary Data ITEM 9. Changes in and Disagreements. PART III ITEM 10. Directors, Executive Officers and Corporate Governance ITEM 11. Executive Compensation ITEM 12. Security Ownership of Certain Beneficial Owners. ITEM 13. Certain Relationships and Related Transactions. ITEM 14. Principal Accounting Fees and Services PART IV ITEM 15. Exhibits, Financial Statement Schedules.
8 c.f. UK Annual Reports Free Style (no standard structure) Use of Images, Text PDF Format Content and structure varies across firms. Management have more discretion over what, where, and how much information on topics such as risk, strategy, performance, etc. is reported. This makes the extraction and analysis task more challenging; but it provides research opportunities.
9 Use contents page to extract text by section from digital pdf
10
11
12
13 Steps in extraction process: Detect contents page Parse contents page Detect page numbering to determine section start/end Add headers as bookmarks to pdf Extract text for each section Analyse extracted text by section and for entire document
14 In addition to performing text extraction, the tool provides a range of text analysis options: Readability metrics Word counts using pre-determined lists (e.g., forward looking, uncertainty, tone, etc.) Word counts based on user-defined wordlists Comparison with reference corpus (word, parts of speech and semantic level) Concordance and collocates Upload and analyse user-defined text file
15 Overview (wordlists, readability metrics) and interface with Wmatrix Uploading one or more annual reports and generating output Uploading and analysing with a user-defined key word list Uploading and analysing a user-defined text file Examples of further analysis in Wmatrix: Cloud for chairman s statement vs. standard reference corpus: word level Cloud for chairman s statement vs. to standard reference corpus: semantic level Cloud for chairman s statement vs. chairman s statement corpus: word level Cloud for chairman s statement vs. chairman s statement corpus: semantic level Concordance/collocation
16 Case study: Strategic content Measure strategic content of annual reports by automatic analysis of words and phrases in financial commentary and business strategy sections Prior UK studies: manual, 300 reports 60,000 sections in 10,000 reports 1156 key words and phrases Clear evidence of doubling of strategic content since 2002
17 Future developments Choice of different readability metrics Text re-use metric to detect boilerplating and incremental changes Reference corpora for: Full Annual Report and Accounts Chairman s Statement CEO Review/Business Review/Operating Review/Financial Review Highlights Corporate Governance Directors Remuneration Readability scores for all UK annual reports for the period Further methods for representing key themes (e.g., strategy, risk, CSR and governance, etc.) by section
Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia
Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia Outline I What is CALL? (scott) II Popular language learning sites (stella) Livemocha.com (stacia) III IV Specific sites
Corpus and Discourse. The Web As Corpus. Theory and Practice MARISTELLA GATTO LONDON NEW DELHI NEW YORK SYDNEY
Corpus and Discourse The Web As Corpus Theory and Practice MARISTELLA GATTO B L O O M S B U R Y LONDON NEW DELHI NEW YORK SYDNEY Contents List of Figures xiii List of Tables xvii Preface xix Acknowledgements
INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER
INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER Mary-Elizabeth ( M-E ) Eddlestone Principal Systems Engineer, Analytics SAS Customer Loyalty, SAS Institute, Inc. AGENDA Overview/Introduction to Data Mining
Automatic Text Analysis Using Drupal
Automatic Text Analysis Using Drupal By Herman Chai Computer Engineering California Polytechnic State University, San Luis Obispo Advised by Dr. Foaad Khosmood June 14, 2013 Abstract Natural language processing
C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER
INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process
Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources
Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources Investigating Automated Sentiment Analysis of Feedback Tags in a Programming Course Stephen Cummins, Liz Burd, Andrew
AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom
AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom Laurence Anthony Waseda University [email protected] Abstract In this paper, I will
Technical Report. The KNIME Text Processing Feature:
Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold [email protected] [email protected] Copyright 2012 by KNIME.com AG
Special Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
Dedoose Distinguishing features and functions
Dedoose Distinguishing features and functions This document is intended to be read in conjunction with the Choosing a CAQDAS Package Working Paper which provides a more general commentary of common CAQDAS
Text Mining - Scope and Applications
Journal of Computer Science and Applications. ISSN 2231-1270 Volume 5, Number 2 (2013), pp. 51-55 International Research Publication House http://www.irphouse.com Text Mining - Scope and Applications Miss
XBRL guide for UK businesses
XBRL guide for UK businesses This document provides an introduction for UK businesses to the extensible Business Reporting Language (XBRL) data format and Inline XBRL (ixbrl), the standard form of presentation
TDPA: Trend Detection and Predictive Analytics
TDPA: Trend Detection and Predictive Analytics M. Sakthi ganesh 1, CH.Pradeep Reddy 2, N.Manikandan 3, DR.P.Venkata krishna 4 1. Assistant Professor, School of Information Technology & Engineering (SITE),
PROMT Technologies for Translation and Big Data
PROMT Technologies for Translation and Big Data Overview and Use Cases Julia Epiphantseva PROMT About PROMT EXPIRIENCED Founded in 1991. One of the world leading machine translation provider DIVERSIFIED
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights
DATA EXPERTS We accelerate research and transform data to help you create actionable insights WE MINE WE ANALYZE WE VISUALIZE Domains Data Mining Mining longitudinal and linked datasets from web and other
Semantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
Question template for interviews
Question template for interviews This interview template creates a framework for the interviews. The template should not be considered too restrictive. If an interview reveals information not covered by
Find the signal in the noise
Find the signal in the noise Electronic Health Records: The challenge The adoption of Electronic Health Records (EHRs) in the USA is rapidly increasing, due to the Health Information Technology and Clinical
Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence
Augmented Search for Web Applications New frontier in big log data analysis and application intelligence Business white paper May 2015 Web applications are the most common business applications today.
STAR WARS AND THE ART OF DATA SCIENCE
STAR WARS AND THE ART OF DATA SCIENCE MELODIE RUSH, SENIOR ANALYTICAL ENGINEER CUSTOMER LOYALTY Original Presentation Created And Presented By Mary Osborne, Business Visualization Manager At 2014 SAS Global
The Knowledge Sharing Infrastructure KSI. Steven Krauwer
The Knowledge Sharing Infrastructure KSI Steven Krauwer 1 Why a KSI? Building or using a complex installation requires specialized skills and expertise. CLARIN is no exception. CLARIN is populated with
Big Data uses cases and implementation pilots at the OECD
Distr. GENERAL Working Paper 28 February 2014 ENGLISH ONLY UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE (ECE) CONFERENCE OF EUROPEAN STATISTICIANS ORGANISATION FOR ECONOMIC COOPERATION AND DEVELOPMENT
Exploration and Visualization of Post-Market Data
Exploration and Visualization of Post-Market Data Jianying Hu, PhD Joint work with David Gotz, Shahram Ebadollahi, Jimeng Sun, Fei Wang, Marianthi Markatou Healthcare Analytics Research IBM T.J. Watson
Yandex: Webmaster Tools Overview and Guidelines
Yandex: Webmaster Tools Overview and Guidelines Agenda Introduction Register Features and Tools 2 Introduction What is Yandex Yandex is the leading search engine in Russia. It has nearly 60% market share
USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE
USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE Ria A. Sagum, MCS Department of Computer Science, College of Computer and Information Sciences Polytechnic University of the Philippines, Manila, Philippines
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
Processing: current projects and research at the IXA Group
Natural Language Processing: current projects and research at the IXA Group IXA Research Group on NLP University of the Basque Country Xabier Artola Zubillaga Motivation A language that seeks to survive
Logistics of setting up a company Dr Carl Firth ASLAN Pharmaceuticals
Logistics of setting up a company Dr Carl Firth ASLAN Pharmaceuticals 20 January 2015 Outline Registering a company Branding Capital structure People Finance IT Offices Other things to consider 2 3 1.
Regulatory News Service (RNS)
Regulatory News Service (RNS) Pricing and Policy Guidelines Effective 1 January 2016 Contents 1.0 Definitions 4 2.0 Introduction 6 3.0 Document purpose and audience 6 4.0 Licence fees and data charges
Unstructured Threat Intelligence Processing using NLP
Accenture Technology Labs Elvis Hovor @kofibaron Shimon Modi @shimonmodi Shaan Mulchandani @alabama_shaan Unstructured Threat Intelligence Processing using NLP Enhancing Cyber Security Operations by Automating
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that
Survey Results: Requirements and Use Cases for Linguistic Linked Data
Survey Results: Requirements and Use Cases for Linguistic Linked Data 1 Introduction This survey was conducted by the FP7 Project LIDER (http://www.lider-project.eu/) as input into the W3C Community Group
The role of modern board. Chartered Institute of Management Accountants
The role of the CFO on the modern board Chartered Institute of Management Accountants The role of the CFO on the modern board 1 INTRODUCTION This briefing is based on the discussion at a recent CIMA event,
PRODUCT DESCRIPTION OF SERVICES PROVIDED BY IPEER
PRODUCT DESCRIPTION OF SERVICES PROVIDED BY IPEER 1. General This agreement / product description ("Product description") hereby states the terms and conditions for the services provided by Ipeer such
WHITEPAPER. Text Analytics Beginner s Guide
WHITEPAPER Text Analytics Beginner s Guide What is Text Analytics? Text Analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content
Binge drinking increases risk of dementia
1 Key words Fill the gaps in the sentences using these key words from the text. dementia condition binge drinking epidemic diagnosis brain damage liver priority cope with reduce 7. If you give something,
CSR REPORT 2016 Corporate Social Responsibility Report
CSR REPORT 2016 Corporate Social Responsibility Report 01 02 03 07 13 14 15 17 Business 19 20 21 22 Support and Contribution 23 Management System 27 31 with Employee 02 Business 03 1 2 04 18 1 3 4 2 6
INTERNET MARKETING. SEO Course Syllabus Modules includes: COURSE BROCHURE
AWA offers a wide-ranging yet comprehensive overview into the world of Internet Marketing and Social Networking, examining the most effective methods for utilizing the power of the internet to conduct
Life's cheap, but who's buying?
Life Conference and Exhibition 2011 Russell Higginbotham and Alan Martin - Swiss Re Life's cheap, but who's buying? 22 nd November 2011 2010 The Actuarial Profession www.actuaries.org.uk Market pricing
Computer Concepts And Applications CIS-107-TE. TECEP Test Description
Computer Concepts And Applications CIS-107-TE This TECEP tests content covered in a one-semester course in computer concepts and applications. It focuses on an overview of computers, including historical
Backbase Accessibility
Whitepaper Learn about: Section 508 Accessibility requirements Backbase compliance Introduction This paper discusses the growing importance of Rich Internet Applications (RIA s) and their support for Accessibility.
CS 6740 / INFO 6300. Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage
CS 6740 / INFO 6300 Advanced d Language Technologies Graduate-level introduction to technologies for the computational treatment of information in humanlanguage form, covering natural-language processing
PREDICTING MARKET VOLATILITY FEDERAL RESERVE BOARD MEETING MINUTES FROM
PREDICTING MARKET VOLATILITY FROM FEDERAL RESERVE BOARD MEETING MINUTES Reza Bosagh Zadeh and Andreas Zollmann Lab Advisers: Noah Smith and Bryan Routledge GOALS Make Money! Not really. Find interesting
Adam Walsh Child Protection and Safety Act. Or SORNA Sex Offender Registration and Notification Act By Chris Phillis Maricopa Public Defender
Adam Walsh Child Protection and Safety Act Or SORNA Sex Offender Registration and Notification Act By Chris Phillis Maricopa Public Defender Signed into law by George W. Bush on July 27, 2006 Creates a
Social Media Monitoring, Planning and Delivery
Social Media Monitoring, Planning and Delivery G-CLOUD 4 SERVICES September 2013 V2.0 Contents 1. Service Overview... 3 2. G-Cloud Compliance... 12 Page 2 of 12 1. Service Overview Introduction CDS provide
How to Drive More Traffic to Your Event Website
Event Director s Best Practices Webinar Series Presents How to Drive More Traffic to Your Event Website Matt Clymer Online Marketing Specialist December 16 th 2010 Today s Speakers Moderator Guest Speaker
Web Developer's Status Report. Laurie Laforest, Ph.D. Academic Technology 22 May 2009
Web Developer's Status Report Laurie Laforest, Ph.D. Academic Technology 22 May 2009 Overview Move to Concordia template Staff training on Contribute software IITS Projects Google audit of sites in Concordia
Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science
Engineering & Health Informa2on Science Engineering NLP Solu/ons for Structured Informa/on from Clinical Text: Extrac'ng Sen'nel Events from Pallia've Care Consult Le8ers Canada-China Clean Energy Initiative
Service Road Map for ANDS Core Infrastructure and Applications Programs
Service Road Map for ANDS Core and Applications Programs Version 1.0 public exposure draft 31-March 2010 Document Target Audience This is a high level reference guide designed to communicate to ANDS external
THE UNIVERSITY OF MANCHESTER PARTICULARS OF APPOINTMENT FACULTY OF MEDICAL AND HUMAN SCIENCES SCHOOL OF PSYCHOLOGICAL SCIENCES
Ref : MHS-05651 Internal ref: MR THE UNIVERSITY OF MANCHESTER PARTICULARS OF APPOINTMENT FACULTY OF MEDICAL AND HUMAN SCIENCES SCHOOL OF PSYCHOLOGICAL SCIENCES RESEARCH ASSISTANT (Polish Speaker) (Ref:
Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology
Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Makoto Nakamura, Yasuhiro Ogawa, Katsuhiko Toyama Japan Legal Information Institute, Graduate
National curriculum tests. Key stage 2. English reading test framework. National curriculum tests from 2016. For test developers
National curriculum tests Key stage 2 English reading test framework National curriculum tests from 2016 For test developers Crown copyright 2015 2016 key stage 2 English reading test framework: national
31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns
GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns Stamatina Thomaidou 1,2, Konstantinos Leymonis 1,2, Michalis Vazirgiannis 1,2,3 Presented by: Fragkiskos Malliaros 2 1 : Athens
CORPORATE GOVERNANCE STATEMENT
CORPORATE GOVERNANCE STATEMENT Extracted from 30 June 2013 Annual Report The Directors of Gascoyne Resources Limited believe that effective corporate governance improves company performance, enhances corporate
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
The authors provide the frameworks, analysis tools and route-maps to understand and action creating a marketdriven
: How to build and lead a market-driven organisation Malcolm McDonald, Martin Christopher, Simon Knox & Adrian Payne FT/Prentice Hall, 2001 ISBN: 0273642499, 206 pages Theme of the Book Marketing is too
OPEN SOURCE SOFTWARE CUSTODIAN AS A SERVICE
OPEN SOURCE SOFTWARE CUSTODIAN AS A SERVICE Martin Callinan [email protected] Wednesday, June 15, 2016 Table of Contents Introduction... 2 Source Code Control... 2 What we do... 2 Service
A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow
A Framework-based Online Question Answering System Oliver Scheuer, Dan Shen, Dietrich Klakow Outline General Structure for Online QA System Problems in General Structure Framework-based Online QA system
Corporate Governance Statement
Corporate Governance Statement The Board of Directors of Sandon Capital Investments Limited (Sandon or the Company) is responsible for the corporate governance of the Company. The Board guides and monitors
Current Themes in Social Marketing Research: Text-mining the past five years
This is a final draft, pre-refereed version of the article. For the final, refereed article, please go to: http://dx.doi.org/10.1080/15245001003746790 Dahl, Stephan (2010) Current Themes in Social Marketing
Voluntary Product Accessibility Template Blackboard Learn Release 9.1 April 2014 (Published April 30, 2014)
Voluntary Product Accessibility Template Blackboard Learn Release 9.1 April 2014 (Published April 30, 2014) Contents: Introduction Key Improvements VPAT Section 1194.21: Software Applications and Operating
Company Tax Returns: format for accounts forming part of an online return
Company s: format for accounts forming part of an online return Overview From 1 April 2011, for accounting periods ending on or after 1 April 2010, Company s must be filed online. A Company comprises the
NHS Procurement Dashboard: Overview
NHS Procurement Dashboard: Overview November 2013 You may re-use the text of this document (not including logos) free of charge in any format or medium, under the terms of the Open Government Licence.
ACT Enrollment Planners Conference
CONVERGE CONSULTING 7.13.16 ACT Enrollment Planners Conference PRESENTERS: Ann Oleson, CEO Jay Kelly, President 2 ABOUT CONVERGE 3 4 AGENDA 2015 Inbound Marketing in Higher Education Survey Methodology
Doing Multidisciplinary Research in Data Science
Doing Multidisciplinary Research in Data Science Assoc.Prof. Abzetdin ADAMOV CeDAWI - Center for Data Analytics and Web Insights Qafqaz University [email protected] http://ce.qu.edu.az/~aadamov 16 May
CDA for Common Document Types: Objectives, Status, and Relationship to Computer-assisted Coding
CDA for Common Document Types: Objectives, Status, and Relationship to Computer-assisted Coding CDA for Common Document Types: Objectives, Status, and Relationship to Computer-assisted Coding by Liora
Product to Customer. through MDM. Presented by Luminita Vollmer, MBA, CDMP, CBIP
Product to Customer A Fundamental Change through MDM Presented by Luminita Vollmer, MBA, CDMP, CBIP May 1, 2012 Atlanda, GA EDW 2012 Contents Introduction The Focus of the Presentation Disclaimer The story
