Theo JD Bothma Department of Informa1on Science theo.bothma@up.ac.za



Similar documents
Opportuni)es and Challenges of Textual Big Data for the Humani)es

Data Warehousing. Yeow Wei Choong Anne Laurent

Project Management Introduc1on

High Performance Compu2ng and Big Data. High Performance compu2ng Curriculum UvA- SARA h>p://

Ins+tuto Superior Técnico Technical University of Lisbon. Big Data. Bruno Lopes Catarina Moreira João Pinho

Using Social Media to Drive Recommender Systems for Mobile Apps. - GRP Presenta=on - Jovian Lin (A M)

How To Use Splunk For Android (Windows) With A Mobile App On A Microsoft Tablet (Windows 8) For Free (Windows 7) For A Limited Time (Windows 10) For $99.99) For Two Years (Windows 9

Application of Supply Chain Concepts to the Analysis Process

Big Data. The Big Picture. Our flexible and efficient Big Data solu9ons open the door to new opportuni9es and new business areas

Discovering Computers Fundamentals, 2010 Edition. Living in a Digital World

Honeycomb Crea/ve Works is financed by the European Union s European Regional Development Fund through the INTERREG IVA Cross- border Programme

BIG DATA AND INVESTIGATIVE ANALYTICS

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson

Social Media Analy.cs (SMA)

Founda'onal IT Governance A Founda'onal Framework for Governing Enterprise IT Adapted from the ISACA COBIT 5 Framework

Extrac'ng People s Hobby and Interest Informa'on from Social Media Content

An to Big Data, Apache Hadoop, and Cloudera

Introduc)on to the IoT- A methodology

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

MSc Data Science at the University of Sheffield. Started in September 2014

UNIFIED, END- TO- END EDISCOVERY

An Open Dynamic Big Data Driven Applica3on System Toolkit

How To Use A Webmail On A Pc Or Macodeo.Com

Stream Deployments in the Real World: Enhance Opera?onal Intelligence Across Applica?on Delivery, IT Ops, Security, and More

The importance of supply chain

Pu?ng B2B Research to the Legal Test

The Library (Big) Data scien4st

NextGen Infrastructure for Big DATA Analytics.

Welcome! Accelera'ng Pa'ent- Centered Outcomes Research and Methodological Research. Andrea Heckert, PhD, MPH Program Officer, Science

Keeping Pace with Big Data

Hands On- Google Grants Google Adwords for Non- Pro5its

Big Data /Data Science Data Intensive (Science) Technologies

How To Understand The Big Data Paradigm

TRANSLATING TECHNOLOGY INTO BUSINESS. Let s make money from Big Data!

The DATA Difference Targe.ng for Stronger ROI!

Power to the People: Analy0cs for All

1 Actuate Corpora-on Big Data Business Analy/cs

Network Maps for End Users: Collect, Analyze, Visualize and Communicate Network Insights with Zero Coding

Migrating to Hosted Telephony. Your ultimate guide to migrating from on premise to hosted telephony.

Synchronous and asynchronous video conferencing tools in an online-course:! Supporting a community of inquiry!

Fixed Scope Offering (FSO) for Oracle SRM

DNS Big Data

XML, Seman9c Web and Content Analy9cs

Mission. To provide higher technological educa5on with quality, preparing. competent professionals, with sound founda5ons in science, technology

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Urban Big Data Centre

Confessions of a new (agile) software project manager. Laura Akerman

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Phone Systems Buyer s Guide

NZ On Air Digital Strategy

So#ware quality assurance - introduc4on. Dr Ana Magazinius

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Data Governance Framework: Bank of Canada

Data archiving and reproducible research for ecology and evolu6on. March 23 rd 2010 Ian Dworkin

elearning: present and future

DTCC Data Quality Survey Industry Report

How to write an effec-ve DIGITAL MARKETING STRATEGY. Secrets from the professionals

Making Sense of Big Data. Dr. Thomas E. Potok Computa2onal Data Analy2cs Group Leader Oak Ridge Na2onal Laboratory

BPO. Accerela*ng Revenue Enhancements Through Sales Support Services

CS 5150 So(ware Engineering System Architecture: Introduc<on

B2B Offerings. Helping businesses op2mize. Infolob s amazing b2b offerings helps your company achieve maximum produc2vity

Predictions for the Digital Workplace 2015

Big Data and Health Insurance Product Selec6on (and a few other applica6on) Jonathan Kolstad UC Berkeley and NBER

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

Promo%ng Your OCS Business through Digital & Social Media. Presented by: John Healy

The Elusive U,lity Customer: How Big Data & Analy,cs Connects U,li,es & Their Customers

ECEC Europe s Cloud Future. Chambre du Commerce September. October. 14. Mai 2013, Konzerthaus. 1st 2014

The Data Reservoir. 10 th September Mandy Chessell FREng CEng FBCS Dis4nguished Engineer, Master Inventor Chief Architect, Informa4on Solu4ons

Offensive & Defensive & Forensic Techniques for Determining Web User Iden<ty

What will I learn as an Computer Engineering student?

SBML SBGN SBML Just my 2 cents. Alice C. Villéger COMBINE 2010

Teaching Analy-cs, Big Data and Sustainability: An IS perspec-ve

Beyond Strategy: Building Your Mobile Capabili6es

From Big Data to Value

Governance as Leadership: Reframing the Work of Nonprofit Boards

This presenta,on covers the essen,al informa,on about IT services and facili,es which all new students will need to get started.

Splunk for Data Science

We are pleased to offer the following program to Woodstock Area Educators:

Big Data in medical image processing

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

Advanced Project Management Training Course

Social Media for Business - Primer. Becky Livingston President & CEO Penheel Marke:ng April 2014

ITS Strategic Plan Enabling an Unbounded University

The system approach in human resources. Functional Analysis of the System for Human Resources Management. Introduction. Arcles

Effec%ve AX 2012 Upgrade Project Planning and Microso< Sure Step. Arbela Technologies

Kaseya Fundamentals Workshop DAY THREE. Developed by Kaseya University. Powered by IT Scholars

Mega Modeling for Scien/fic Big Data Processing

Home Selling Marke/ng Proposal

Scalus Winter School Storage Systems

The Adop)on Pa-erns of Mobile Telephones by Micro and Small Enterprises in Ghana

An Integrated Approach to Manage IT Network Traffic - An Overview Click to edit Master /tle style

Rethink. Recruitment. McFrank & Williams Adver3sing Agency

Physiotherapy & Occupational Therapy

Scalus A)ribute Workshop. Paris, April 14th 15th

San Jacinto College Banner & Enterprise Applica5on Review Task Force Report. November 01, 2011 FINAL

Advanced Fraud Detection & Prevention Through Big Data

Expanding Assessment of Analy3cal Skills among Biology Majors: From Introductory labs to Upper Division Elec3ves

Suppor&ng a social media research environment by mining big textual data. Sophia Ananiadou Na-onal Centre for Text Mining

The model of SWOT-analysis is the most

Transcription:

Theo JD Bothma Department of Informa1on Science theo.bothma@up.ac.za Reflec1ons on the role of corpora and big data in e- lexicography in rela1on to end user informa1on needs CILC 2015 7th Interna1onal Conference on Corpus Linguis1cs Valladolid,, 5 7 March 2015

Overview Introduc)on Clarifica)on of terms Access to linked corpora Access to frequency pa9erns Technical issues Conclusion

Introduc1on Focus of presenta)on Func)on theory Filtering of data Data on demand Lexicotainment

Focus of presenta1on How corpora and big data can be used to supplement dic)onary data Specifically for end user informa)on needs On demand Why users would need such data Research that needs to be done To provide useful tools Focus on end user

Outside scope of presenta1on The use of corpora by lexicographers to create dic)onaries Predefined corpora or web as corpus Many papers at the conference The use of corpora and big data in digital humani)es research Lexicographic research The use of big data by commercial en))es

Func1on theory Communica)ve situa)ons where a need to solve a communica)on problem may occur Text recep)on Text produc)on Transla)on Cogni)ve situa)ons where a need for knowledge may occur Opera)ve situa)ons Interpre)ve situa)ons

Filtering of data...uncovering the needs users have in the last 20 percent of the look- ups, i.e. in one out of five consulta)ons...discover the needs that only show up in one out of a hundred or one out of a thousand consulta)ons (Tarp, 2009a)...ar)cles that are especially adapted individualiza)on of the lexical product, adap)ng to the concrete needs of a concrete user (Tarp, 2009b)

Data on demand One cannot develop separate dic)onaries for 1 in a 1000 queries Data on demand Filtering data through search and presenta)on op)ons More bu9on Link to internal data Link to external data, including corpora Link to lexicographic user support tools

Lexicotainment Commercial publishers develop many data on demand op)ons for online dic)onaries OWen interac)ve There must be a need for such func)ons Why develop them if not used? OWen / usually free Marke)ng Vast array of gadgets / tools / types of informa)on

Word of the day Blog Language )ps Quizzes Crosswords Trending Interna)onal Local Slide shows Dic1onary.com

Dic1onary.com local lookups

Merriam- Webster

Merriam- Webster (2)

Clarifica1on of terms / Examples Corpus Big data Structured / unstructured data Data analy)cs Text / data mining

Corpus No defini)on required Oxford English Corpus...over 2 billion words of real 21st century English. It is not only size that ma9ers, though: it is the size of the corpus coupled with the careful selec)on and development of its contents which means that it is a resource unlike any other in the world. (h9p://www.oxforddic)onaries.com/words/about- the- oxford- english- corpus) UMBC WebBase Corpus over three billion words, 48GB (h9p://ebiquity.umbc.edu/resource/html/id/351) Library of Congress 10 TB, probably about 3 Petabytes (3,000 TB) if all mul)media is included (h9p://blogs.loc.gov/digitalpreserva)on/2012/03/how- many- libraries- of- congress- does- it- take/)

Google Books corpus The total collec)on contains more than 6% of all books ever published. (Lin, Y et al. 2012)

Big data Big data is data that exceeds the processing capacity of conven)onal database systems. The data is too big, moves too fast, or doesn t fit the strictures of your database architectures. To gain value from this data, you must choose an alterna)ve way to process it. (Edd Dumbill. 2013. Making sense of big data. Big data 1(1). h9p://online.liebertpub.com/doi/pdf/10.1089/big.2012.1503)

Big data (2) Big data is high- volume, - velocity and - variety informa)on assets that demand cost- effec)ve, innova)ve forms of informa)on processing for enhanced insight and decision making. (Beyer, MA & Laney, D. 2012. "The Importance of 'Big Data': A Defini)on. Gartner....the real- )me and high frequency nature of the data is also key. For example, nowcas)ng is used extensively and adds considerable power to predic)on. Similarly the high frequency of data allows users to test theories in near real- )me and to a level never before possible. (The Challenges and awards of big data. 2013.)

Examples The NSA very casually dropped a number: Every six hours, the agency collects as much data as is stored in the en)re Library of Congress. (h9p://www.popsci.com/technology/ar)cle/2011-05/every- six- hours- nsa- gathers- much- data- stored- en)re- library- congress) ebay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommenda)ons, and merchandising. (Tay, L. 2013. Inside ebay s 90PB data warehouse. h9p:// www.itnews.com.au/news/342615,inside- ebay8217s- 90pb- data- warehouse.aspx)

Square Kilometre Array The SKA represents the ul)mate Big Data challenge (h9p://www.zurich.ibm.com/pdf/astron/cebit%202013%20background%20 DOME.pdf) The project is expected to deliver up to an exabyte a day of raw data, compressed to some 10 petabytes of data in images for storage. "This telescope will generate the same amount of data in a day as the en)re planet does in a year. We es)mate that there will be more data flowing inside the telescope network than the en)re internet in 2020." (h9p://www.computerworld.com.au/ar)cle/392735/ ska_telescope_generate_more_data_than_en)re_internet_2020/) The data collected by the SKA in a single day would take nearly two million years to playback on an ipod. (h9ps://www.skatelescope.org/amazingfacts/)

h9p://datacook.blogspot.com/

Structured / unstructured data Structured data Data that resides in a fixed field within a record or file. Unstructured data All those things that can't be so readily classified and fit into a neat box: photos and graphic images, videos, streaming instrument data, webpages, pdf files, PowerPoint presenta)ons, emails, blog entries, wikis and word processing documents. Semi- structured data A cross between the two. Tags or other types of markers are used to iden)fy certain elements within the data, but the data doesn t have a rigid structure. (h9p://www.webopedia.com/term/s/structured_data.html)

Data analy1cs Predic)ve analy)cs, data mining, text mining, forecas)ng and data op)miza)on. (h9p://www.webopedia.com/term/b/big_data_analy)cs.html) Big data uses induc)ve sta)s)cs and concepts from nonlinear system iden)fica)on to infer laws (regressions, nonlinear rela)onships, and causal effects) from large sets of data with low informa)on density to reveal rela)onships, dependencies and perform predic)ons of outcomes and behaviors (h9p://en.wikipedia.org/wiki/big_data#science)

Text / data mining Data Mining is an analy)c process designed to explore data (usually large amounts of data - typically business or market related - also known as "big data") in search of consistent pa9erns and/or systema)c rela)onships between variables, and then to validate the findings by applying the detected pa9erns to new subsets of data. The ul)mate goal of data mining is predic)on - and predic)ve data mining is the most common type of data mining and one that has the most direct business applica)ons. (StatSoW, Inc. (2013). Electronic Sta)s)cs Textbook. Tulsa, OK: StatSoW. h9p:// www.statsow.com/textbook/.)

Big data in e- lexicography Yes Volumes Database requirements Processing requirements Research Digital humani)es Word studies No Speed of change Nature of user requirements Ac)onable

Two examples Access to linked corpora Access to frequency pa9erns Data available on demand Examples from exis)ng dic)onaries / tools Characteris)cs Why end users would want to do this Problems

Frequency pa]erns Only one commercial applica)on discussed: Google Ngram viewer Mul)ple databases Some tagged for PoS and syntac)c dependencies 12 language universal part- of- speech tags and unlabeled head- modifier dependencies (Lin, Y et al. 2012. Syntac)c Annota)ons for the Google Books Ngram Corpus. In Proceedings of the 50th Annual Mee7ng of the Associa7on for Computa7onal Linguis7cs, pp. 169 174, Jeju, Republic of Korea, 8-14 July 2012. Associa)on for Computa)onal Linguis)cs)

Racialism / Racism OED: Racialism = racism n. An earlier term than racism n., but now largely superseded by it Examples form 1902 2001 Racism Examples from 1903 2003 Not dis)nguished from racialism Approximately similar entries for racialist / racist

racialism / racism

Walkman / Ipod

catch up with X / catch X up On demand

Why end users would want to do this Cogni)ve / lexicotainment Simply interested to learn more about word usage and word history See word in context over )me Understand word usage be9er Text produc)on Decide between alterna)ves Situate word in historical context when wri)ng a text

Problems No genre- specific search op)ons General trade, fic)on, academic, newspaper, etc. No usage- tagged search op)on Formal, colloquial and slang, regional, etc. Direct speech Date of wri)ng not dis)nguished from date of context

Corpora and frequency tables Posi)ve Examples of actual usage Genre- specific dis)nc)ons DWDS Limited availability for Google Ngram viewer Fic)on / non- fic)on Bri)sh / American Drill- down to context Nega)ve Time dis)nc)ons Limited drill- down op)ons Limited granularity

Racialism / Racism OED no help both occur Example from fic)on: Text wri9en in 2012, set in USA in 1941 (vol. 2 of trilogy) Racialist counted as occurrence in 2012 Racialist used in direct speech, racism in narra)ve Text wri9en in 2014, set in USA in 1961 (vol. 3) Uses racism exclusively (direct speech and narra)ve) Current UK TV programme racialist Does corpus reflect usage in the 1940s / 1960s / 2000s?

Walkman / ipod Evident rise of ipod vs Walkman Does the higher use of Walkman even in 2008 reflect actual technology prolifera)on? To what extent do books (vs other media) reflect actual usage?

Conjugated / Inflected forms Excellent feature Aggregated usage? PoS tagging Future tagging? Syntac)c (cf. Google Books) Seman)c Etc.

Catch X up / Catch up with X Mixed bag un)l 1930s ThereaWer clear preference for catch up with X Current Bri)sh colloquial catch you up? Is there a dis)nc)on between formal and colloquial? Would a different corpus reveal a different pa9ern? Is there a difference between the two items in direct speech compared to narra)ve?

Technical issues Corpus and data set selec)on Corpus clean- up Corpus markup Search func)ons and filtering Data presenta)on and usability issues

Corpus and data set selec1on Decide on intended use Lexicographer, researcher, professional, end user Decide on characteris)cs Contemporary / diachronic Genre- specific / general Formal / Informal (e.g. social media) Size Selec)ve or as large as possible Hardware / sowware Copyright

Corpus clean- up Digi)sa)on Image clean- up OCR Quality control of both images and OCR Digi)zed materials Remove noise HTML Long- term preserva)on and cura)on of originals

Corpus markup Linguis)c processing Tokeniza)on PoS tagging Lemma)za)on Addi)onal gramma)cal markup Metadata markup Standard Bibliographic data Gramma)cal tagging Addi)onal markup Genre Date of seyng Direct speech vs narra)ve

Filtering and search func1ons Selec)on of corpus or mul)ple corpora Based on markup Metadata Gramma)cal Fine grained Subset of corpora Date Genre Etc. Complex combina)on of criteria

Data presenta1on and usability issues Incorporated into dic)onary interface How Display only required data Drill- down to actual texts on demand From corpus examples From frequency tables User studies

Examples Show all examples of: catch PRON up and catch up with PRON In a newspaper corpus Da)ng between 1940 and 1950 Show all examples of: racialist and racist Bri)sh fic)on compared to Bri)sh newspapers Between 1940 and 1960 Dis)nguished between direct speech and narra)ve

Conclusion Many addi)onal tools possible Mul)discplinary research Impact on markup Database technologies Hardware requirements User and usability studies Develop prototypes Innova)on Think cri)cally about the way forward

Lexicographers can considerably enhance the user experience by making non- tradi)onal data available to their end users through exploi)ng the technologies and data accessible through corpora, big data sets and the internet

Thank you! Ques1ons / comments? theo.bothma@up.ac.za