COMPUTATIONAL DATA ANALYSIS FOR SYNTAX



Similar documents
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Index. 344 Grammar and Language Workbook, Grade 8

Authorship and Writing Style: An Analysis of Syntactic Variable Frequencies in Select Texts of Alejandro Casona

Master of Arts Program in Linguistics for Communication Department of Linguistics Faculty of Liberal Arts Thammasat University

L130: Chapter 5d. Dr. Shannon Bischoff. Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25

10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words

What Is Linguistics? December 1992 Center for Applied Linguistics

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3

Statistical Machine Translation

THE BACHELOR S DEGREE IN SPANISH

Language Arts Literacy Areas of Focus: Grade 6

Teaching English as a Foreign Language (TEFL) Certificate Programs

Master of Arts in Linguistics Syllabus

LESSON THIRTEEN STRUCTURAL AMBIGUITY. Structural ambiguity is also referred to as syntactic ambiguity or grammatical ambiguity.

Texas Success Initiative (TSI) Assessment. Interpreting Your Score

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Electronic offprint from. baltic linguistics. Vol. 3, 2012

Study Plan for Master of Arts in Applied Linguistics

Noam Chomsky: Aspects of the Theory of Syntax notes

stress, intonation and pauses and pronounce English sounds correctly. (b) To speak accurately to the listener(s) about one s thoughts and feelings,

Introduction to formal semantics -

Grammar Presentation: The Sentence

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Subordinating Ideas Using Phrases It All Started with Sputnik

University of Massachusetts Boston Applied Linguistics Graduate Program. APLING 601 Introduction to Linguistics. Syllabus

A Programming Language for Mechanical Translation Victor H. Yngve, Massachusetts Institute of Technology, Cambridge, Massachusetts

Reading Competencies

Skills for Effective Business Communication: Efficiency, Collaboration, and Success

GMAT.cz GMAT.cz KET (Key English Test) Preparating Course Syllabus

COURSE OBJECTIVES SPAN 100/101 ELEMENTARY SPANISH LISTENING. SPEAKING/FUNCTIONAl KNOWLEDGE

Year 1 reading expectations (New Curriculum) Year 1 writing expectations (New Curriculum)

ONLINE ENGLISH LANGUAGE RESOURCES

Online Tutoring System For Essay Writing

CHARTES D'ANGLAIS SOMMAIRE. CHARTE NIVEAU A1 Pages 2-4. CHARTE NIVEAU A2 Pages 5-7. CHARTE NIVEAU B1 Pages CHARTE NIVEAU B2 Pages 11-14

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language

Historical Linguistics. Diachronic Analysis. Two Approaches to the Study of Language. Kinds of Language Change. What is Historical Linguistics?

Psychology G4470. Psychology and Neuropsychology of Language. Spring 2013.

1 Basic concepts. 1.1 What is morphology?

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

Overview of the TACITUS Project

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

A Writer s Reference, Seventh Edition Diana Hacker Nancy Sommers

Definition of terms. English tests. Writing. Guide to technical terms used in the writing mark scheme for the internally marked test

Submission guidelines for authors and editors

COURSE SYLLABUS ESU 561 ASPECTS OF THE ENGLISH LANGUAGE. Fall 2014

Hybrid Strategies. for better products and shorter time-to-market

AN INTERACTIVE ON-LINE MACHINE TRANSLATION SYSTEM (CHINESE INTO ENGLISH)

UNIVERSITÀ DEGLI STUDI DELL AQUILA CENTRO LINGUISTICO DI ATENEO

Get the most value from your surveys with text analysis

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5

Name: Note that the TEAS 2009 score report for reading has the following subscales:

Written Language Curriculum Planning Manual 3LIT3390

Modern foreign languages

Language Meaning and Use

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Development of a Dependency Treebank for Russian and its Possible Applications in NLP

Advanced Grammar in Use

Texas Success Initiative (TSI) Assessment

Czech Verbs of Communication and the Extraction of their Frames

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

12 FIRST QUARTER. Class Assignments

A Machine Translation System Between a Pair of Closely Related Languages

English Descriptive Grammar

English Discoveries Online Alignment with Common European Framework of Reference

Domain Knowledge Extracting in a Chinese Natural Language Interface to Databases: NChiql

Building a Question Classifier for a TREC-Style Question Answering System

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

Writing Common Core KEY WORDS

Cambridge Primary English as a Second Language Curriculum Framework

A + dvancer College Readiness Online Alignment to Florida PERT

English Language Proficiency Standards: At A Glance February 19, 2014

CELTA. Syllabus and Assessment Guidelines. Fourth Edition. Certificate in Teaching English to Speakers of Other Languages

PTE Academic Preparation Course Outline

Course Description (MA Degree)

Course Planner for NorthStar, Second Edition, Reading and Writing, Advanced: Student Book and Writing Activity Book Six Classroom Hours

EAP Grammar Competencies Levels 1 6

English Language (first language, first year)

Student Guide for Usage of Criterion

KINDGERGARTEN. Listen to a story for a particular reason

Ask your teacher about any which you aren t sure of, especially any differences.

Comparative Analysis on the Armenian and Korean Languages

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

A Comparative Analysis of Standard American English and British English. with respect to the Auxiliary Verbs

Doctoral School of Historical Sciences Dr. Székely Gábor professor Program of Assyiriology Dr. Dezső Tamás habilitate docent

SYNTACTIC PATTERNS IN ADVERTISEMENT SLOGANS Vindi Karsita and Aulia Apriana State University of Malang

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

Customizing an English-Korean Machine Translation System for Patent Translation *

A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students

English Appendix 2: Vocabulary, grammar and punctuation

NAME: DATE: Leaving Certificate BUSINESS: Enterprise. Business Studies. Vocabulary, key terms working with text and writing text

Doe wat je niet laten kan: A usage-based analysis of Dutch causative constructions. Natalia Levshina

Intro to Linguistics Semantics

KSE Comp. support for the writing process 2 1

Pupil SPAG Card 1. Terminology for pupils. I Can Date Word

READING THE NEWSPAPER

Computer Aided Document Indexing System

Transcription:

COLING 82, J. Horeck~ (ed.j North-Holland Publishing Compa~y Academia, 1982 COMPUTATIONAL DATA ANALYSIS FOR SYNTAX Ludmila UhliFova - Zva Nebeska - Jan Kralik Czech Language Institute Czechoslovak Academy of Sciences Prague ~.S.S.R. Methodology and results of complex computational analysis of present-day standard Czech are presented. According to computer programmes various linguistic observations were achieved, concerning especially dependency syntax. INTRODUCTION The aim of this paper is to present the methodology and results of a detailed analysis of syntactic structures in Czech, performed with the aid of computer. This work is a part of a large and long-term project, known as Quantitative analysis of present-day standard Czech, which is carried out in the Department of mathematical linguistics at the Czech language Institute under the leadership of M. T~itelova ~1 ~, [2]. Some hitherto achieved results have already been published in special monographs /e.g. the frequency dictionaries of publicist and administrative styles [31, t4], a volume of quantitative characteristics of publicist style [5]- these prints were issued by the Czech language Institute, Prague/ or as articles and papers e.g. in PBML 31 and 32 [6] and PSML 7 and 8 tt]; other articles are to be published within a short time /as e.g. the frequency dictionary of scientific Czech/ or they have been already prepared for publication. PROPOSITIONS AND SYNTACTIC CODE The target of the syntactic analysis as well as of the whole project is to obtain a coherent set of mutually related quantitative parameters concerning the Czech language system, its functioning in different communicative spheres, as well as its stylistic values. To guarantee the reliability and representativeness of re- 391

392 L. UHLII~OV~,, I. NEBESK,g, and J. KR,~LfK sults, the research is based on the corpt;s of 540 000 word-forms /occurrences/ chosen from non-narrative /i.e. newspaper, administrative and scientific/ texts. During the preparatory works all word- -forms in 180 samples - each sample consisting of 3000 running word-forms - were supplied with a special code carrying both morphological and syntactic information about parts of speech, all main morphological categories, relevant in Czech /such as case, nominal gender and number, person, number, tense, mood, voice of verbs etc. with a very detailed subcategorization/ and all main syntactic categories /such as subject, object, predicate, attributes types of adverbials etc., again with the necessary subcategories/. The syntactic code was basically digital, with three exceptions: the characters "+" and "-" expressed whether the dependent sentence element follows or precedes the governing word /sentence member/, and the character "!" was used for special defective sentence constructions. For purposes of frequency lists, each word-form was given a basic lexical information /lemma/. The coding and lemmatization was done by Linguists in such a way, that after marking boundaries between sentences and clauses each word-form was given information about its syntactic function /"membership in sentence"/ and about its governing word in the dependency tree. Each clause was then given information about its structural position in the complex or compound sentence~ then coordination between words or between clauses was marked; finally, information about the linear arrangement of words in sentences, of clauses in complex and compound sentences, and of sentence wholes in text was added, so as to enable the complete reconstruction of running texts by computer, if necessary, at any time. The whole corpus together with the encoded information came by means of punched cards /80 columns/ over a current input programme into external computer memory on magnetic tapes /each tape Library can carry max. 30 texts, the whole corpus is contained in 7 tapes/. Each record got a special translation zone to guarantee the possibility of obligatory /non-standard/ sorting with respect to different alphabetic ordering of some Czech Letters /record size on tape is 130, block size BS=6502/. The automatic processing has been executed at the computer TESLA 200 in the computer-centers OTZ~HT and OFPL, CzechosLovak Academy of Sciences, all programmes and

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX 393 their modifications were written in internal programming language "APS". The survey of the main results, given below, is ordered from the most simple to the more complicated ones, with respect to the computer programmes and with respect to the linguistic information obtained. SIMPLE COMPUTATIONAL CHARACTERISTICS The first set of programmes gives by its simple structure as to the programming technique the essential totals of frequencies /occurrences/ of encoded syntactic categories, individual syntactic features and various items under investigation. Number of results of this kind was obtained by means of reading or repeated reading of the magnetic tape library and through simple adding of different code items. The received data offer us a basic survey about the frequencies of sentence elements and simple, complex and compound sentences, about types of syntagms /both determinative and coordinative/, about the frequencies of two-element and one-element sentences and their patterns, about the frequencies of types of subordinate clauses, some word-order and clause-order characteristics and the frequency ratio of simple and complex/compound sentences. In addition, there have been collected some data concerning how often various sentence elements are expressed by a nominal or an adverbial phrase and how often they are expressed by a dependent clause, concerning the most frequent types of complex sentences, including frequencies and functions of various syntactic connectors. Using the cycles within counting programmes the distributions of syntactic units were obtained in a similarly simple and prompt way. This part of the computer work yielded the length distributions of clauses and sentences expressed in number of words, or in number of clauses. COMPOUND COMPUTATIONAL CHARACTERISTICS By doubling or chaining of testing subprogrammes and cycles another set of programmes was constructed for more complicated searching and output of syntactic characteristics. Specially commented tab-

394 L. UHLll~OV, I. NEBESK/~ and J. KR~,LIK les, as well as larger sets of numeric data supplying rich material for further steps of analysis were obtained. Whereas the results of the programme set mentioned above referred to frequencies of individual syntactic categories, the programmes reported about in this paragraph were concentrated on their relationships. Attention was paid especially to the relationships between syntactic elements and their part-of-speech appurtenance, to the syntactic relevance of some morphological categories /e.g. of noun cases/, to the correlation between sentence length and complexity of its structure, to the relationship between types of subordinate clauses and their linear position in complex sentences etc. Some of the statistical data obtained have confirmed our intuitive expectations /e.g. concerning syntactic functions of parts of speech and syntactic functions of cases of nouns/, others lead us to a deeper insight into interrelations between linguistic levels, esp. about the connection between the lexical and the syntactic levels. TYPES OF VERBAL CONTEXTS SEARCHED BY COMPUTER Using the computer operation memory we overcame the technical impossibitity of a reverse magnetic tape reading. This enabled us to prepare the third set of output we received whole nents with required code programmes with many variations. As an sentences, sentence types or their compocombinations or with immediate verbal contexts. Thus we could study not only abstract syntactic categories as such, reported above, but we also could take into account the concrete lexical manifestations of various syntactic elements, units and categories. Some interesting tendencies were found, concerning the insertion of certain lexical types into different syntactic positions, e.g. types of adjectives typical for predicative positions and other typical for attributive positions. The relationship between the semantics of the co-ordinated syntactic elements and their functions in topic-comment structure was studied, the correlation between the frequencies of subordinate clauses and lexical semantics of the governing predicate was proved /with predicates expressing the attitude of the speaker to the content of communication/, the correlation between morphological category of infinitive and semantic category of modality was found. A special attention was

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX 395 paid to the syntactic structures with verbs to be and to have. MAIN RESULTS AND PERSPECTIVES The hitherto made experiences have shown that even a very extensive statistico-linguistic project can be successfully carried through, if ~here is the aid of computer. Results obtained up to now offer a very detailed picture about the functional load of syntactic elements and units in texts from various styles and from various communicative spheres of the present-day Czech. However, the importance of the project, which has not yet been completely finished, consists also in the recognition and understanding of quantitative linguistic principles, relationships, tendencies and general laws. Statistics of syntax contrasts by some of its features with the statistics of other linguistic levels. If compared with some hierarchically lower levels, such as with phonology or morphology, and their units, the sentence as the basic syntactic unit is structurally much more complex /not representing the mere summ of elements and forms of the lower levels/ and, larger, too. For these reasons it disposes with a considerably higher combination possibilities, and consequently of richer posibilities of individual usage of linguistic means of its creation and usage during the communicative process. On the other hand, if compared with concrete lexical items, most properties of sentence are of abstract, categorial nature; the inventory of sentence patterns is strictly limited in number, and therefore they are repeated very often in texts, which of course contributes to the neutralization of their st~listic value. The computer aided quantitative analysis of syntax proved to be a valuable counterpart of qualitative structural research in describing and evaluating the functioning of language means in communication.

396 L. UHLII~OV, t. NEBESK.~ and J. KRALIK REFERENCES: 11 T~itelov~, M., Ot~zky lexikalni statistiky /Academia, Praha, 1974/. [2] T~itelovQ, M., Vyu~iti statistick~ch metod v gramatice /Academia, Praha, 1980/. ~ Frekven~ni slovnik sou~asn~ ~esk~ publicistiky, T~itelov~, M. /ed./, /Ustav pro jazyk ~esk~, Praha, 1980/. ~] Frekven~ni slovnik sou~asn~ administrativy, T~itelovA, M. /ed./, /Ostav pro jazyk ~esk~, Praha, 1980/o ~] Kvantitativni charakteristiky sou~asn~ publicistiky, Linguistica II, T~ itelov~, M /ed./, /Ustav pro jazyk ~esk~, Praha, 1982/. ~] Prague Bulletin of Mathematical Linguistics vol.31 and 32 /UniVersita Karlova, Praha, 1979/. ~] Prague Studies in Mathematical Linguistics vol.7 /Academia, Praha, 1981/ and vol.8 /Academia, Praha, in print/.