Theo JD Bothma Department of Informa1on Science theo.bothma@up.ac.za Reflec1ons on the role of corpora and big data in e- lexicography in rela1on to end user informa1on needs CILC 2015 7th Interna1onal Conference on Corpus Linguis1cs Valladolid,, 5 7 March 2015
Overview Introduc)on Clarifica)on of terms Access to linked corpora Access to frequency pa9erns Technical issues Conclusion
Introduc1on Focus of presenta)on Func)on theory Filtering of data Data on demand Lexicotainment
Focus of presenta1on How corpora and big data can be used to supplement dic)onary data Specifically for end user informa)on needs On demand Why users would need such data Research that needs to be done To provide useful tools Focus on end user
Outside scope of presenta1on The use of corpora by lexicographers to create dic)onaries Predefined corpora or web as corpus Many papers at the conference The use of corpora and big data in digital humani)es research Lexicographic research The use of big data by commercial en))es
Func1on theory Communica)ve situa)ons where a need to solve a communica)on problem may occur Text recep)on Text produc)on Transla)on Cogni)ve situa)ons where a need for knowledge may occur Opera)ve situa)ons Interpre)ve situa)ons
Filtering of data...uncovering the needs users have in the last 20 percent of the look- ups, i.e. in one out of five consulta)ons...discover the needs that only show up in one out of a hundred or one out of a thousand consulta)ons (Tarp, 2009a)...ar)cles that are especially adapted individualiza)on of the lexical product, adap)ng to the concrete needs of a concrete user (Tarp, 2009b)
Data on demand One cannot develop separate dic)onaries for 1 in a 1000 queries Data on demand Filtering data through search and presenta)on op)ons More bu9on Link to internal data Link to external data, including corpora Link to lexicographic user support tools
Lexicotainment Commercial publishers develop many data on demand op)ons for online dic)onaries OWen interac)ve There must be a need for such func)ons Why develop them if not used? OWen / usually free Marke)ng Vast array of gadgets / tools / types of informa)on
Word of the day Blog Language )ps Quizzes Crosswords Trending Interna)onal Local Slide shows Dic1onary.com
Dic1onary.com local lookups
Merriam- Webster
Merriam- Webster (2)
Clarifica1on of terms / Examples Corpus Big data Structured / unstructured data Data analy)cs Text / data mining
Corpus No defini)on required Oxford English Corpus...over 2 billion words of real 21st century English. It is not only size that ma9ers, though: it is the size of the corpus coupled with the careful selec)on and development of its contents which means that it is a resource unlike any other in the world. (h9p://www.oxforddic)onaries.com/words/about- the- oxford- english- corpus) UMBC WebBase Corpus over three billion words, 48GB (h9p://ebiquity.umbc.edu/resource/html/id/351) Library of Congress 10 TB, probably about 3 Petabytes (3,000 TB) if all mul)media is included (h9p://blogs.loc.gov/digitalpreserva)on/2012/03/how- many- libraries- of- congress- does- it- take/)
Google Books corpus The total collec)on contains more than 6% of all books ever published. (Lin, Y et al. 2012)
Big data Big data is data that exceeds the processing capacity of conven)onal database systems. The data is too big, moves too fast, or doesn t fit the strictures of your database architectures. To gain value from this data, you must choose an alterna)ve way to process it. (Edd Dumbill. 2013. Making sense of big data. Big data 1(1). h9p://online.liebertpub.com/doi/pdf/10.1089/big.2012.1503)
Big data (2) Big data is high- volume, - velocity and - variety informa)on assets that demand cost- effec)ve, innova)ve forms of informa)on processing for enhanced insight and decision making. (Beyer, MA & Laney, D. 2012. "The Importance of 'Big Data': A Defini)on. Gartner....the real- )me and high frequency nature of the data is also key. For example, nowcas)ng is used extensively and adds considerable power to predic)on. Similarly the high frequency of data allows users to test theories in near real- )me and to a level never before possible. (The Challenges and awards of big data. 2013.)
Examples The NSA very casually dropped a number: Every six hours, the agency collects as much data as is stored in the en)re Library of Congress. (h9p://www.popsci.com/technology/ar)cle/2011-05/every- six- hours- nsa- gathers- much- data- stored- en)re- library- congress) ebay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommenda)ons, and merchandising. (Tay, L. 2013. Inside ebay s 90PB data warehouse. h9p:// www.itnews.com.au/news/342615,inside- ebay8217s- 90pb- data- warehouse.aspx)
Square Kilometre Array The SKA represents the ul)mate Big Data challenge (h9p://www.zurich.ibm.com/pdf/astron/cebit%202013%20background%20 DOME.pdf) The project is expected to deliver up to an exabyte a day of raw data, compressed to some 10 petabytes of data in images for storage. "This telescope will generate the same amount of data in a day as the en)re planet does in a year. We es)mate that there will be more data flowing inside the telescope network than the en)re internet in 2020." (h9p://www.computerworld.com.au/ar)cle/392735/ ska_telescope_generate_more_data_than_en)re_internet_2020/) The data collected by the SKA in a single day would take nearly two million years to playback on an ipod. (h9ps://www.skatelescope.org/amazingfacts/)
h9p://datacook.blogspot.com/
Structured / unstructured data Structured data Data that resides in a fixed field within a record or file. Unstructured data All those things that can't be so readily classified and fit into a neat box: photos and graphic images, videos, streaming instrument data, webpages, pdf files, PowerPoint presenta)ons, emails, blog entries, wikis and word processing documents. Semi- structured data A cross between the two. Tags or other types of markers are used to iden)fy certain elements within the data, but the data doesn t have a rigid structure. (h9p://www.webopedia.com/term/s/structured_data.html)
Data analy1cs Predic)ve analy)cs, data mining, text mining, forecas)ng and data op)miza)on. (h9p://www.webopedia.com/term/b/big_data_analy)cs.html) Big data uses induc)ve sta)s)cs and concepts from nonlinear system iden)fica)on to infer laws (regressions, nonlinear rela)onships, and causal effects) from large sets of data with low informa)on density to reveal rela)onships, dependencies and perform predic)ons of outcomes and behaviors (h9p://en.wikipedia.org/wiki/big_data#science)
Text / data mining Data Mining is an analy)c process designed to explore data (usually large amounts of data - typically business or market related - also known as "big data") in search of consistent pa9erns and/or systema)c rela)onships between variables, and then to validate the findings by applying the detected pa9erns to new subsets of data. The ul)mate goal of data mining is predic)on - and predic)ve data mining is the most common type of data mining and one that has the most direct business applica)ons. (StatSoW, Inc. (2013). Electronic Sta)s)cs Textbook. Tulsa, OK: StatSoW. h9p:// www.statsow.com/textbook/.)
Big data in e- lexicography Yes Volumes Database requirements Processing requirements Research Digital humani)es Word studies No Speed of change Nature of user requirements Ac)onable
Two examples Access to linked corpora Access to frequency pa9erns Data available on demand Examples from exis)ng dic)onaries / tools Characteris)cs Why end users would want to do this Problems
Frequency pa]erns Only one commercial applica)on discussed: Google Ngram viewer Mul)ple databases Some tagged for PoS and syntac)c dependencies 12 language universal part- of- speech tags and unlabeled head- modifier dependencies (Lin, Y et al. 2012. Syntac)c Annota)ons for the Google Books Ngram Corpus. In Proceedings of the 50th Annual Mee7ng of the Associa7on for Computa7onal Linguis7cs, pp. 169 174, Jeju, Republic of Korea, 8-14 July 2012. Associa)on for Computa)onal Linguis)cs)
Racialism / Racism OED: Racialism = racism n. An earlier term than racism n., but now largely superseded by it Examples form 1902 2001 Racism Examples from 1903 2003 Not dis)nguished from racialism Approximately similar entries for racialist / racist
racialism / racism
Walkman / Ipod
catch up with X / catch X up On demand
Why end users would want to do this Cogni)ve / lexicotainment Simply interested to learn more about word usage and word history See word in context over )me Understand word usage be9er Text produc)on Decide between alterna)ves Situate word in historical context when wri)ng a text
Problems No genre- specific search op)ons General trade, fic)on, academic, newspaper, etc. No usage- tagged search op)on Formal, colloquial and slang, regional, etc. Direct speech Date of wri)ng not dis)nguished from date of context
Corpora and frequency tables Posi)ve Examples of actual usage Genre- specific dis)nc)ons DWDS Limited availability for Google Ngram viewer Fic)on / non- fic)on Bri)sh / American Drill- down to context Nega)ve Time dis)nc)ons Limited drill- down op)ons Limited granularity
Racialism / Racism OED no help both occur Example from fic)on: Text wri9en in 2012, set in USA in 1941 (vol. 2 of trilogy) Racialist counted as occurrence in 2012 Racialist used in direct speech, racism in narra)ve Text wri9en in 2014, set in USA in 1961 (vol. 3) Uses racism exclusively (direct speech and narra)ve) Current UK TV programme racialist Does corpus reflect usage in the 1940s / 1960s / 2000s?
Walkman / ipod Evident rise of ipod vs Walkman Does the higher use of Walkman even in 2008 reflect actual technology prolifera)on? To what extent do books (vs other media) reflect actual usage?
Conjugated / Inflected forms Excellent feature Aggregated usage? PoS tagging Future tagging? Syntac)c (cf. Google Books) Seman)c Etc.
Catch X up / Catch up with X Mixed bag un)l 1930s ThereaWer clear preference for catch up with X Current Bri)sh colloquial catch you up? Is there a dis)nc)on between formal and colloquial? Would a different corpus reveal a different pa9ern? Is there a difference between the two items in direct speech compared to narra)ve?
Technical issues Corpus and data set selec)on Corpus clean- up Corpus markup Search func)ons and filtering Data presenta)on and usability issues
Corpus and data set selec1on Decide on intended use Lexicographer, researcher, professional, end user Decide on characteris)cs Contemporary / diachronic Genre- specific / general Formal / Informal (e.g. social media) Size Selec)ve or as large as possible Hardware / sowware Copyright
Corpus clean- up Digi)sa)on Image clean- up OCR Quality control of both images and OCR Digi)zed materials Remove noise HTML Long- term preserva)on and cura)on of originals
Corpus markup Linguis)c processing Tokeniza)on PoS tagging Lemma)za)on Addi)onal gramma)cal markup Metadata markup Standard Bibliographic data Gramma)cal tagging Addi)onal markup Genre Date of seyng Direct speech vs narra)ve
Filtering and search func1ons Selec)on of corpus or mul)ple corpora Based on markup Metadata Gramma)cal Fine grained Subset of corpora Date Genre Etc. Complex combina)on of criteria
Data presenta1on and usability issues Incorporated into dic)onary interface How Display only required data Drill- down to actual texts on demand From corpus examples From frequency tables User studies
Examples Show all examples of: catch PRON up and catch up with PRON In a newspaper corpus Da)ng between 1940 and 1950 Show all examples of: racialist and racist Bri)sh fic)on compared to Bri)sh newspapers Between 1940 and 1960 Dis)nguished between direct speech and narra)ve
Conclusion Many addi)onal tools possible Mul)discplinary research Impact on markup Database technologies Hardware requirements User and usability studies Develop prototypes Innova)on Think cri)cally about the way forward
Lexicographers can considerably enhance the user experience by making non- tradi)onal data available to their end users through exploi)ng the technologies and data accessible through corpora, big data sets and the internet
Thank you! Ques1ons / comments? theo.bothma@up.ac.za