Corpus and Discourse. The Web As Corpus. Theory and Practice MARISTELLA GATTO LONDON NEW DELHI NEW YORK SYDNEY



Similar documents
icompilecorpora: A Web-based Application to Semi-automatically Compile Multilingual Comparable Corpora

A Creative Commons License Filter

On-Page SEO (changes to the subject website and its pages) and; Off-Page SEO (getting links from external websites).

Real-Time Identification of MWE Candidates in Databases from the BNC and the Web

AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom

CENTRAL COUNTY REGIONAL OCCUPATIONAL PROGRAM COURSE OUTLINE INTERNET/WEB DESIGN & DEVELOPMENT

LAMUS & LAT Archiving software

DEFINING EFFECTIVENESS FOR BUSINESS AND COMPUTER ENGLISH ELECTRONIC RESOURCES

Glossary of translation tool types

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Most EFL teachers in Japan find that there are groups of verbs which consistently

for Hundreds of Ready-to-Use Phrases to Set the Stage for Productive Conversations, Meetings, and Events Meryl Runion with Diane Windingland

SEO: What is it and Why is it Important?

CORPUS VS DICTIONARY IN EFL CLASSES

Upload Your Culminating Project to The Repository at St. Cloud State University

LINGUISTIC SUPPORT IN "THESIS WRITER": CORPUS-BASED ACADEMIC PHRASEOLOGY IN ENGLISH AND GERMAN

Relationship marketing

Annotated Corpora in the Cloud: Free Storage and Free Delivery

SEO Marketing Strategy. Keeping you connected through SEO

The Oxford Learner s Dictionary of Academic English

Principles of Distributed Database Systems

Website Marketing Audit. Example, inc. Website Marketing Audit. For. Example, INC. Provided by

ZINEUK LOCAL SEO AND SOCIAL MEDIA. Social Media and SEO Marketing to improve your Digital Presence.

Creation of Focused Web Archives for Scientists

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Trends in corpus specialisation

TechWatch. Technology and Market Observation powered by SMILA

Reading Listening and speaking Writing. Reading Listening and speaking Writing. Grammar in context: present Identifying the relevance of

Using the BNC to create and develop educational materials and a website for learners of English

Simple maths for keywords

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

A Corpus-Based Tool for Exploring Domain-Specific Collocations in English

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

CLARIN (in the) UK Tools and Services

Master of Arts in Linguistics Syllabus

Contents. Assessing Social Media Security. Chapter! The Social Media Security Process 3

The greenlight sector. REPORT FOCUS: Natural Search Paid Media Integrated Search

SharePoint Overview, Governance, and Planning. (^Rll^^fc^ i ip?"^biifiis:'iissiipi. Scott Jamison. Susan Hanley Mauro Cardarelli.

THE USE OF SPECIALISED CORPORA: IMPLICATIONS FOR RESEARCH AND PEDAGOGY

Thank you so much for your interest in The Measured Mom printables!

Big Data uses cases and implementation pilots at the OECD

Development and Management

SEARCH ENGINE OPTIMIZATION (SE0)

As stated by Valero-Garcés (2006, p. 38), the new scenario including

Social Media Monitoring Tools enhanced by Semantic Web Technologies. Presentation of the Master Thesis Fabian Gasser

Domain Classification of Technical Terms Using the Web

DIGITAL MARKETING PROPOSAL. Stage 1: SEO Audit/Correction.

Local SEO for the Small Business Owner 2011

Your Toughest Questions. Answered

Marketing Online: How to Get Found, Generate Leads, and Convert Customers

Collecting Polish German Parallel Corpora in the Internet

Joint Research Centre

Localizing Your Mobile App is Good for Business

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Problems and Measures Regarding Waste 1 Management and 3R Era of public health improvement Situation subsequent to the Meiji Restoration

Digital Communication and Interoperability - A Case Study

Milk, bread and toothpaste : Adapting Data Mining techniques for the analysis of collocation at varying levels of discourse

Date submitted: June 1, 2011

Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure

TEC 2014 Hyderabad. Adam Scott. The Interface of Corpus Linguistics and The Lexical Approach for Teaching ESL Beginner Learners Effectively

AN ONLINE WRITING SUPPORT INITIATIVE FOR FIRST-YEAR INFORMATION TECHNOLOGY STUDENTS

Page One Promotions Digital Marketing Pricing

Management. Oracle Fusion Middleware. 11 g Architecture and. Oracle Press ORACLE. Stephen Lee Gangadhar Konduri. Mc Grauu Hill.

Question template for interviews

11 Core Elements Of A Successful Digital and Content Marketing Campaign. RODA marketing is a full service digital marketing and consulting agency.

viii Javier E. Díaz-Vera, Rosario Caballero

Contemporary Linguistics

Get to Grips with SEO. Find out what really matters, what to do yourself and where you need professional help

Implementing and Administering an Enterprise SharePoint Environment

Computing. Federal Cloud. Service Providers. The Definitive Guide for Cloud. Matthew Metheny ELSEVIER. Syngress is NEWYORK OXFORD PARIS SAN DIEGO

WORK PLACEMENTS IN LOND N

Raise your Google Ranking

CONTEMPORARY DIRECT & INTERACTIVE MARKETING

Big data workshop. Digital Reading Network 6 th March 2014, Sheffield. Andrew Salway, Uni Research, Bergen Daniel Allington, The Open University

Archives Alive! Lesson Plans

Language Processing and the Clean Up System

Schema documentation for types1.2.xsd

SILO SEO. How do the search engines pull this information together to determine the overall theme of each page? What is Silo SEO?

EFL Learners Synonymous Errors: A Case Study of Glad and Happy

Introduction. Chapter 1 Why Understanding Your Web Traffic Is Important to Your Business 3

THE CONTENT MARKETING GUIDE. How We Drive Rankings And Traffic With Content?

DualShield SAML & SSO. Integration Guide. Copyright 2011 Deepnet Security Limited. Copyright 2011, Deepnet Security. All Rights Reserved.

The Cloud: Searching for Meaning

D3.1: SYSTEM TEST SUITE

BITS: A Method for Bilingual Text Search over the Web

Copyright 2013 wolfssl Inc. All rights reserved. 2

POLICIES AND PROCEDURES FOR ONLINE AND HYBRID COURSES UNION COLLEGE FALL, Union College Online Standards Page 1 of 8

Web 3.0 image search: a World First

Top Tips for Running An Effective PPC Campaign. British Business Show : Google Workshops

Oxford Dictionary of Current Idiomatic English

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Second Edition (DSAGE. Los Angeles London New Delhi Singapore Washington DC

MA APPLIED LINGUISTICS AND TESOL

WEBSITE MARKETING REVIEW

Terms of Reference for Design, Development and Maintenance of Intranet for UN in India

CASE STUDY. Generating Qualified Leads that Translate into Increased Sales SEARCH - MARKETING - SOCIAL - MOBILE - ADVERTISING

Technology in language documentation

Introduction. Philipp Koehn. 28 January 2016

Web Video Planning Guide

Two Semantic-based products for Large Document Corpus Research; Semantic Insights Research Assistant and Research Librarian

Transcription:

Corpus and Discourse The Web As Corpus Theory and Practice MARISTELLA GATTO B L O O M S B U R Y LONDON NEW DELHI NEW YORK SYDNEY

Contents List of Figures xiii List of Tables xvii Preface xix Acknowledgements xxi Introduction 1 1 Corpus Linguistics: Basic Principles 5 Introduction 5 1. Theory, approach, methods: Corpus linguistics as a research field 5 2. Key issues in corpus linguistics 7 2.1 Authenticity 9 2.2 Representativeness 10 2.3 Balance and sampling 12 2.4 Size 14 2.5 Types of corpora 15 3. Corpus, concordance, collocation: Tools and analysis 16 3.1 Corpus creation 16 3.2 Corpus analysis 18 3.2.1 Wordlists and keywords 19 3.2I2 Concordances 23 3.3 Collocation and basic statistics 25 3.4 Colligation and semantic associations 29 Conclusion 31 Study questions and activities 32 Suggestions for further reading 33

2 The Body and the Web: An Introduction to the Web as Corpus 35 Introduction 35 1. Corpus linguistics and the web 35 2. The web as corpus: A 'body' of texts? 39 3. The corpus and the web: Key issues 41 3.1 Authenticity 42 3.2 Representativeness 43 3.3 Size 45 3.4 Composition 49 3.4.1 Medium 51 3.4.2 Language 52 3.4.3 Topics 57 3.4.4 Registers, (web) genres, and text types 61 3.5 Copyright 63 4. From 'body' to 'web': New issues 65 4.1 Dynamism 66 4.2 Reproducibility 68 4.3 Relevance and reliability 69 Conclusion 70 Study questions and activities 71 Suggestions for further reading 71 3 Challenging Anarchy: Web Search from a Corpus Perspective 73 Introduction 73 1. The corpus and the search 73 2. Search engine basics: Crawling, indexing, searching, ranking 11 3. Google and the others: An overview of commercial search engines 79 4. Challenging anarchy: Mastering advanced web search 83 4.1 An overview of web search options 84 4.2 Limits and potentials of 'webidence' 87 4.3 Phrase search and collocation 88 4.4 Phraseology and patterns 91 4-5.Provenance, site, domain and more: Searching subsections of the web 93 4.6 Testing translation candidates 96 5. Query complexity: Web search from a corpus perspective 101 Conclusion 102

xi Study questions and activities 102 Suggestions for further reading 103 4 Beyond Ordinary Search Engines: Concordancing the Web 105 Introduction 105 1. Beyond ordinary search engines: Concordancing the web 105 2. WebCorp Live 106 3. Concordancing the web in the foreign language classroom 113 3.1 Exploring collocation: The case of scenery 113 3.2 Investigating neologisms and phrasal creativity 115 4. Beyond web concordancing tools: The web as/for corpus 119 5. Towards a linguist's search engine: The case of WebCorp/.SE 121 5.1 The Web as/for Corpus: From WebCorp Live to WebCorpLSf 122 5.2 Using WebCorpZ.SE to explore contemporary English 125 5.2.1 Synchronic English Web Corpus and Diachronic English Web Corpus 126 5.2.2 Birmingham Blog Corpus 131 Conclusion 134 Study questions and activities 134 Suggestions for further reading 136 5 Building and Using Comparable Web Corpora: Tools and / Methods 137 Introduction 137 1. Building DIY web corpora 137 2. from words to corpus: The 'bootstrap' process 140 2.1 Compiling a domain-specific corpus with BootCaT 140 2.2 Compiling specialized corpora with WebBootCaT 147 3. Building and using comparable web corpora for translation practice 154 Conclusion 158 Study questions and activities 159 Suggestions for further reading 161 6 Sketches of Language and Culture from Large Web Corpora 163 Introduction 163 1. From web as corpus to corpus as web: Introducing large general purpose web corpora 163 i

2. Mega-corpus, mini-web: The case of ukwad; 16 2.1 Selecting 'seed URLs'J'and crawling 167 I 2.2 Post-crawl cleaning and annotation 169 I: 3. Exploring large web corpora: Tools and resources 17,1 3.1 The Sketch Engine: From concordance lines to word sketches 172 3.2 The Sketch Difference function 179 4- Case study: Sketches of culture from the BNC to the; web 182 4.1 A very difficult word...! 183,;j '1, ( 4.2 Sketches of culture from the British National ([orpus 4.3 Sketches of culture from UkWaC 187 I I 1 4.3.1 Culture as object 188. 4.3.2 Culture as subject 193 4.3.3 Modifiers of culture 195 4.3.4 Culture as modifier 196 4.3.5 The pattern culture and/or NOUN 198 4.4 A culture of: The changing face of culture in contemporary society 198 Conclusion 202 Study questions and activities 203 Suggestions for further reading 203 il 84 7 From Download to Upload: The Web as Corpus in the Web 2.0 Era 205 Introduction 205 1. From download to upload: Web users as prosumers 205 2. Web 2.0 as corpus. The case of Wikipedia as a multilingual corpus 207 3. The corpus in the cloud? The challenges that lie ahead 208 Suggestions for further reading 210 Conclusion 211 1 References 215* Index 229