Ambiguous Prepositional Phrase Resolution by Humans. Joseph Houpt

Similar documents

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Multiple Linear Regression

Statistical Models in R

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Testing for Lack of Fit

Additional sources Compilation of sources:

Simple linear regression

Premaster Statistics Tutorial 4 Full solutions

Final Exam Practice Problem Answers

CALCULATIONS & STATISTICS

Module 5: Multiple Regression Analysis

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Developing a Stock Price Model Using Investment Valuation Ratios for the Financial Industry Of the Philippine Stock Market

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Paraphrasing controlled English texts

Comparing Nested Models

5. Linear Regression

Simple maths for keywords

Fairfield Public Schools

Univariate Regression

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Using R for Linear Regression

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Multiple Regression: What Is It?

Analysing Questionnaires using Minitab (for SPSS queries contact -)

How Far is too Far? Statistical Outlier Detection

Part 2: Analysis of Relationship Between Two Variables

individualdifferences

4. Continuous Random Variables, the Pareto and Normal Distributions

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

4. Multiple Regression in Practice

Analysis of Variance ANOVA

Simple Linear Regression Inference

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

1.5 Oneway Analysis of Variance

Language Modeling. Chapter Introduction

Regression 3: Logistic Regression

Logistic Regression (a type of Generalized Linear Model)

Statistics 2014 Scoring Guidelines

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

ABSORBENCY OF PAPER TOWELS

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Correlation and Simple Linear Regression

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

STAT 350 Practice Final Exam Solution (Spring 2015)

Association Between Variables

2. Simple Linear Regression

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

THE KRUSKAL WALLLIS TEST

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Week 5: Multiple Linear Regression

II. DISTRIBUTIONS distribution normal distribution. standard scores

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont

2. Linear regression with multiple regressors

International Statistical Institute, 56th Session, 2007: Phil Everson

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Lecture Notes Module 1

3. Mathematical Induction

Predicting Box Office Success: Do Critical Reviews Really Matter? By: Alec Kennedy Introduction: Information economics looks at the importance of

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Exploratory Data Analysis

Statistical Models in R

Lucky vs. Unlucky Teams in Sports

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

UNDERSTANDING THE TWO-WAY ANOVA

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Author Gender Identification of English Novels

Predict the Popularity of YouTube Videos Using Early View Data

SPSS Guide: Regression Analysis

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Improving SAS Global Forum Papers

Inference for two Population Means

ANOVA. February 12, 2015

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Module 3: Correlation and Covariance

n + n log(2π) + n log(rss/n)

Keywords academic writing phraseology dissertations online support international students

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Thai Language Self Assessment

Chapter 7: Simple linear regression Learning Objectives

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Cohesive writing 1. Conjunction: linking words What is cohesive writing?

Transcription:

Ambiguous Prepositional Phrase Resolution by Humans Joseph Houpt Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2006

Abstract This paper examines the information humans use to deal with ambiguous prepositional phrase attachments. The work is done using a large corpus of eye-tracking data for both English and French. Multiple regression and linear mixed effect models are used to examine the significance of various factors. Variables that have been shown to effect reading time in experimental settings such as attachment type and the head words of the prepositional phrases are not found to be significant in most cases. There is a significant interaction between attachment type and language found. When the data was transformed there is also significance reported for some head words. i

Acknowledgements I am particularly thankful to my fiancee for her support, both mentally and editorially. I would also like to thank my family for the support they have given me and for forgiving my lack of communication during this project and my mother for her editorial help. Finally, I would like to thank my supervisor, Frank Keller, for the direction and input he has given me for this project. ii

Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Joseph Houpt) iii

Table of Contents 1 Introduction 1 2 Literature Review 5 3 Methods 9 3.1 Description of the data used........................... 9 3.2 Data processing.................................. 11 3.3 Data analysis................................... 12 4 Results 15 4.1 Variables used................................... 15 4.1.1 Total reading time............................ 15 4.1.2 Number of words............................. 18 4.1.3 Average frequency per word....................... 21 4.1.4 Average number of characters per word................. 21 4.1.5 Attachment type............................. 24 4.1.6 Preposition head............................. 24 4.1.7 Similarity ratio.............................. 24 4.2 Variable interaction................................ 29 4.3 Modeling results................................. 30 4.3.1 French models.............................. 30 4.3.2 English models.............................. 32 4.3.3 English - French comparative model................... 36 5 Discussion 37 A Two-way Relationship Plots 41 B Models Including Two-Way Interaction 48 B.1 English...................................... 48 B.2 French....................................... 53 iv

Bibliography 55 v

Chapter 1 Introduction The sentences we hear and read are processed effortlessly. The information is automatically extracted from the sentence as it is input. The process starts with taking the physical representation of a sentence to using the information held within the sentence. The collection of processes that make up this transformation is known as the human sentence processing mechanism (henceforth HSPM). There are many different levels at which the processing takes place. First, there is the translation of the physical form, whether it be sound or written text, to a form that can be interpreted by the brain. This information is then grouped in to words. The words can have a meaning, refer to something, and can play different roles in the sentence. This level is referred to as the lexical and semantic level of processing. The sentence also has a structure that determines the interactions between the words. This is referred to as the structural or syntactic level. Also, each sentence is normally part of a larger context such as a conversation or a text which is known as the discourse level. The way that these different levels interact in the process of interpreting a sentence has been the subject of much debate. There are those who claim that each level is interpreted totally separately from the others in different stages, with no interaction. This theory considers each level to be processed by separate informationally encapsulated processes that work in serial. There are other theories that claim that the different levels are used together to process the sentence. The most cited example of theory based on information encapsulation is the garden path model. This model treats sentence processing in two stages. The first stages create a structural representation of the sentence using only the grammatical categories of the words, along with general syntactic principles. The general rules of syntax are first applied, which include rules such as: a noun phrase consists of a determiner and a noun. There is some overlap in what these rules cover, so at some stages of processing there can be multiple possible structures. In this case, the HSPM is thought to defer to make a choice based on one of two principles: late 1

closure and minimal attachment. Late closure: When possible, attach incoming lexical items into the clause or phrase currently being processed. Minimal attachment: Attach incoming material into the phrase-marker being constructed using the fewest nodes consistent with the well-formedness rules of the language. (Frazier and Rayner, 1982, pg. 180) If the HSPM reaches a point in the sentence where the structure it has chosen to that point turns out not to be tenable, it will return to the point of ambiguity and try another option. This revision can be on the basis of structure, such as when there are words that are part of the sentence but there are no rules that would account for them. The sentence is also checked for semantic coherence at this point. If there is a structure that does not make sense, such as seeing with a fork, then the structure can be revised at this stage. Another theory holds that the HSPM uses some semantic information when it is building the structure from the beginning. As the sentence is parsed there is not such a strong separation between the use of the syntactic information and the semantic information. Instead, at a point of ambiguity the HSPM chooses a structure informed by the semantics of the words involved. There are various types of lexical information that can be used. In one theory, whether or not the phrase plays the part of an argument is an important factor in determining the structure of a sentence, referred to as the arguementhood of the phrase. The argument is considered to be as follows: If a phrase P is an argument of a head H, P fills a role in the relation described by H, the presence of which may be implied by H. P s contribution to the meaning of the sentence is a function of that role and hence depends on the particular identity of H. (Schütze and Gibson, 1999, pg. 410) In some sentences the structure can still not be determined with the semantics of those words. One example is The cop saw the robber with the binoculars. This sentence could mean either the cop was using the binoculars, or the robber had binoculars. In these cases, the discourse level of information is needed. There are some theories that claim that this information is used in making the structural decisions from the beginning instead of at later stages, for example [Crain and Steedman (1985)]. More recently, some theorists have argued that the frequencies of different structures are a factor used in determining the structure of a sentence. In this case when there are multiple choices for a structure, the more likely one is chosen. Which frequencies are used and how finely they are calculated is not always agreed upon. In one version of the theory a combination of semantic and statistical information is used. For example, if the verb is an action verb then it is more likely to have a prepositional phrase attached [MacDonald et al. (1994)]. The words themselves can also factor into the decision. For example if the preposition is of, then the prepositional phrase is more likely to attach to the noun phrase.

This paper uses these points of ambiguity to investigate the information that is important to the HSPM. The focus is on prepositional phrase attachment ambiguity, as it is the most common type and thus has the most data available. In the corpus used 43% of the English sentences and 34% of the French sentences contained an ambiguous prepositional phrase. Prepositional phrase ambiguity arises from the possibility of two different rules that could apply to the sequence: VP NP PP NP. One possibility is VP -> V NP and NP -> NP PP and the other is VP -> NP PP. The former is referred to as high attachment and the later as low attachment. In the sentence Jane at the salad with a fork, the prepositional phrase is normally interpreted to mean that the fork was used to eat the salad. This is the high attachment structure [Fig. 1]. The alternative, low attachment structure would be interpreted as the salad was in possession of a fork when Jane ate it [Fig. 1]. A better example of a low attachment sentence would be Jane at the salad from Tesco [Fig. 1].

There are some ambiguous prepositional phrases for which the choice of structure does not make a difference to the meaning. An example of this type from Hindle and Rooth (1993) is: The organization has opened a cleaning center in Seward. Here if the in Seward is attached to the verb has opened, the cleaning center is understood to be in Seward. Similarly, if in Seward is attached to the noun phrase cleaning center, the cleaning center is still understood to be in Seward. Hindle and Rooth refer to these these situations as semantically indeterminate [Hindle and Rooth (1993)].

Chapter 2 Literature Review To investigate the plausibility of the HSPM using the minimal attachment principle, Frazier and Rayner used eye-tracking. The sentences they use for the experiment include a disambiguation zone. Here, the HSPM is forced to choose a specific structure. They focus on the reading time for this zone, using a reading time per character measure. For the sentences they test, they found a significantly increased reading time for low attachment over high attachment [Frazier and Rayner (1982)]. One criticism was that the sentences used in Frazier and Rayner s experiments were presented in isolation. This does not account for context, which Crain and Steedman (1985) argue is used in the initial disambiguation decision In a later study, Altmann tested the extent to which biasing contextual information impacted the reading times. He also used a disambiguation zone to test the reading times. When there was no context presented, Altmann verified that reading times for minimal attachment sentences were faster. However, when the context was set up to bias the reader to a non-minimal attachment, he found that the minimal attachment reading time was slower [Altmann (1985)]. Within a corpus of natural text, the theory predicts that one attachment type is not any faster than the other as long as there is a biasing context. If reading times were found to be lower in general for high attachment decisions, this could be because context did not induce bias in a large percentage of cases. Thus, although the context biasing theory is intuitive and has been shown in an experimental setting, it is difficult to verify in a natural setting. The sentences used for the experiments were devised to create a specific bias. This works well in a limited experimental situation. However, finding evidence within a corpus of natural text is complicated by factors such as determining the biasing context for a significant number of sentences - decisions that could, in turn, be subject to disagreement. In this research, I assume it unlikely that naturally occurring text in context introduces a bias toward a certain attachment decision while another is intended. Another problem for the garden path model was that the same effects did not occur cross- 5

linguistically. Frazier did present evidence that the minimal attachment principle is used in Dutch [Frazier (1987)]. Cuetos and Mitchell (1988) present evidence based on both sentence completion and on-line testing that late closure is not a linguistic universal. Zagar et al. (1997) present evidence that early closure is preferred in French as well. To adapt the garden path theory to the cross-linguistic evidence, Frazier presented construal theory [Frazier and Jr. (1997)]. As the construal theory does not make any distinction in the strength of attachment preference between languages, there would be no interaction of language and attachment type in the reading time prediction. An alternative approach that Mitchell and Cuetos consider to reconcile the evidence of early closure preference in Spanish is based on the statistics of the language [Mitchell and Cuetos (1991) as cited in Zagar et al. (1997)]. This theory is often referred to as the linguistic tuning hypothesis. It predicts that the HSPM prefers the structure that is most common to the language. Thus, if early closure is more common in French, then reading time will be faster for early closure sentences in French. Likewise, since late closure is more common in English, reading time is faster than for early closure sentences. MacDonald et al. (1994) develop the idea more thoroughly and suggest possible frequencies that are important to prepositional phrase disambiguation. They suggest that head word co-occurrence as well as prepositional head preference are important factors. Schütze (1995) reviewed many of the previous studies. He argues that these effects are more succinctly described with argument/modifier distinctions. He claims that arguementhood had not been properly controlled for, but would explain many of the results reported. A study followed in which Schütze and Goodman control for arguementhood and show that there is a significant impact on reading time [Schütze and Gibson (1999)]. Testing this theory on a large natural language corpus presents difficulties similar to those of Altmann s theory. Determining the arguementhood of a prepositional phrase must be done manually, and thus would consume a large amount of time. Furthermore, determining if the presence of the prepositional phrase is implied by a head word in a natural setting is not always straightforward and therefore subject to disagreement. One general consequence of this theory would be an interaction between head words of the noun phrase, the head of the verb phrase, and the attachment type. The rule above states the presence of the prepositional phrase is implied depending on the head word of the phrase to which it may attach. In testing their theory, Schütze and Gibson assume that if there is a head word that clearly implies the presence of a preposition phrase, then the reading time for that prepositional phrase would be fast in comparison with a phrase that does not have a head word that clearly implies it [Schütze and Gibson (1999)]. Although this implies a clear way to test the theory on natural language, it would require a lot of data, especially if the changes in reading time were particularly small. The lexical-frequency based position holds that the argument/adjunct distinction can be

reduced to relative frequencies [ MacDonald et al. (1994)]. Thus, if a prepositional phrase more commonly occurs with the head of the verb phrase than the head of the noun phrase, the high attachment is preferred, and visa versa. Schütze and Gibson (1999) compare the finding in their paper with frequency-based accounts using P(PPhead V Phead) compared to P(PPhead NPhead). There is a high correlation here between the co-occurrence of the words and their lexical similarity. Prepositional phrase attachment is also particularly problematic for machine parsers. Thus, there has been a large amount of research into what features are useful in determining the correct parse for a sentence. I assume that features that are particularly helpful to machine disambiguation are more likely to be used by the HSPM by virtue of the information they contain. Collins and Brooks (1995) report that based on the assumption that the prepositional phrase always attaches to the noun, 59% accuracy is achieved. This suggest that a general syntactic rule could be the basis for a structural choice for the HSPM. The percentage of attachment decisions that disambiguate to low attachment is not overwhelming. If the HSPM did use this default strategy, it would not be very efficient. Also, this would suggest that the default is contrary to the evidence in support of a high attachment default. Using the most likely attachment for each of the prepositional phrase head words increases the accuracy to 79% [Collins and Brooks (1995)]. MacDonald et al. (1994) do suggest that the preposition is used by the HSPM. If the head word is used by the HSPM in determining an initial attachment decision, then an attachment type - head word interaction is expected. With just the four head words, Ratnaparkhi et al. (1994) report that humans have an average accuracy of 88.2%, compared with 93.2% for the whole sentence. This suggests that a significant amount of information for the disambiguation can be found from just the head words, but that there is also information available in the rest of the sentence. Also, the accuracy would theoretically increase if the sentences were given in context as well. This does not imply that these different pieces of information are used by the HSPM in the initial attachment decision, just that it is used at some point. Using a model based on the head words alone, Collins and Brooks (1995) report 84.1% accuracy for machine disambiguation. Another model using transformation based learning with just the head words achieved 80.8% accuracy. Thirteen of the top 20 transformations were based on the preposition alone [Brill and Resnik (1994)]. Many of the computational approaches to PP disambiguation have increased accuracy when some type of semantic information is included. A variety of methods have been used to include this information. Ratnaparkhi et al. (1994) use mutual information clustering to classify the head words. Including classes in their model increases the accuracy from 77.7% to 81.6%. Brill and Resnik (1994) classify the head words base on WordNet to reach an accuracy of

81.8% accuracy. Although this accuracy is not much higher than the 80.8% accuracy without class information, far fewer transformations were required. This suggests that the information contained in the words themselves can be abstracted to information about the semantic class of the words. Budanitsky and Hirst argue that to measure lexical similarity, WordNet based and other semantically based approaches are superior, in part due to the fact that co-occurrence is not necessarily a metric [Budanitsky and Hirst (2006)]. If this is the case, it would be worthwhile to compare the predictive influences of a co-occurrence based metric and a semantic-based measure for reading time to determine whether the bias is indeed purely frequency based or is based more on similarity. This measure would be limited to the similarity of the head words of the verb phrase and the two noun phrases, as there is no semantic corpus relating those head words to prepositions.

Chapter 3 Methods 3.1 Description of the data used The data used in this research are from the Dundee Eye-tracking Corpus. This corpus contains eye-tracking data for 20 subjects, 10 native French speakers and 10 native English speakers. Each subject read 800 screens of 5 lines of text per screen, for a total of 4000 lines in their respective native language. The total word counts are 51,502 in English and 47,445 in French. The French text for the Dundee corpus is from editorials in the French newspaper, Le Monde. This text is a subset of the text used for the French Treebank allowing for crossreferencing of the data. The version of the French Treebank available included full syntactic parses for 1,081 of the 1,990 sentences in the Dundee data. The English text is from editorials published in the British newspaper The Independent. There was no syntactic information available for this text. The text was tagged using the TnT software [Brants (2000)] trained on the Wall Street Journal section of the Penn Treebank [Marcus et al. (1994)]. The sentences were then filtered based on whether they could contain a syntactically ambiguous prepositional phrase attachment. The criteria used was meant to be broad enough that all cases would appear in the filtered data set, at the expense of allowing too many sentences through. The criteria were as follows: noun = /(NN CD LS PRP$? WP WP$? DT) / verb := /VB(D G N P Z)/ prepositon := /(IN TO )/ /.* verb.* noun.* preposition.* noun.*/ The sentence must contain some type of verb, followed by some type of noun, number, or some type of pronoun. Then, either a word tagged as IN that could be used as a preposition or TO must follow. Finally, there must be a noun, number, or pronoun after the potential preposition. The words were required to be in this order, but not necessarily adjacent. Once filtered, the sentences were manually checked for an ambiguous prepositional phrase attachment structure. 9

Figure 3.1: High ambiguous PP attachement Figure 3.2: Low ambiguous PP attachement Syntactic information pertaining to the identification of the prepositional phrase was added (verbal nucleus; noun phrase 1; prepositional phrase; noun phrase 2) to sentences of this type. The definition of a verbal nucleus is based on that used for the French Treebank and is adapted for use in English. The verbal nucleus is defined as clitics, auxiliaries, negation and verb [Laboratoire de Linguistique Formelle (2006)]. NP1 refers to the noun phrase that is the immediate child of the verb phrase and is either the parent or the sibling of the prepositional phrase in question. The noun phrase object of the prepositional phrase is referred to as NP2. An ambiguous prepositional phrase structure was of one of two types. If the prepositional phrase was attached to the noun phrase (NP1), then there would be a verbal nucleus with its immediate sibling to the right NP1, and the last child of NP1 would be the prepositional phrase (see Figure 3.2). For the case of high attachment, the verbal nucleus, noun phrase 1, and the prepositional phrase would be immediate siblings in that order (see Figure 3.1). In both cases the object of the prepositional phrase must be a noun phrase (NP2). In cases where the attachment type was not clear, similar sentences with a consistent attachment from the Penn Treebank were used to determine the structure. A best guess was used if there were not similar situations or if there were conflicting attachment types. This only occurred on around 5% of sentences. One hundred sentences were randomly selected and at-

tachment choices were made for these by a second person to calculate the level of agreement. 82.2% were disambiguated in the same manner resulting in Cohen s κ agreement of.68. A previous study showed 91.3% agreement when disambiguating prepositional phrase attachments in English [Ratnaparkhi et al. (1994)]. Frequency information for English was taken from the written section of the British National Corpus as harvested by Adam Kilgarriff. French frequency was based on the text on the CD-ROM du Monde Diplomatique (1987-1997) as harvested by Jean Véronis. Both frequencies were smoothed using Church-Turing for words that occurred fewer than 10 times. 3.2 Data processing To align the eye-tracking data from the Dundee Corpus with the syntactic information from the French Treebank, all non-letter characters were removed. The remaining letters were converted to lower case. The accents of all characters were also removed, due to inconsistency in the accents between the Treebank and the Dundee Corpus. The parsed sentences were then collected along with a line cross-referencing them with the Dundee Corpus. If there was not an available match for the sentence, but the previous and next sentences had exactly one line between them in the other corpus, then those sentences were assumed to correspond. In each of these cases the correspondence was verified manually. Although the same process was not necessary for the English data, the tagged sentences were treated as a separate corpus and processed the same as the French data to maintain consistency in data format. Each sentence with syntactic information was then checked for prepositional phrase attachments that were syntactically ambiguous. The criteria used for finding ambiguities in English described in Section 3.1 were used. The references to the Dundee Corpus were then returned for those cases of syntactic ambiguity along with whether the attachment was high or low. The head words for each of the verbal nucleus, noun phrase 1, prepositional phrase, and noun phrase 2 were also extracted at this point. For the English sentences, the rules used to find the head words were those used by David Magerman, with the exception that possessive noun phrases were not marked up and thus not treated separately [Collins and Magerman (1995)]. The French head-finding rules used were those developed by Abhishek Arun [Arun (2004)]. In the final step of preprocessing, Pederson, et al. s implementation [Pedersen et al. (2004)] of the Lesk similarity measure for words in the English language was used in conjunction with WordNet version 2.1 [Patwardhan et al. (2003)]. The similarity measure for the head word of NP2 and each of NP1 and the verbal nucleus was calculated.

3.3 Data analysis The variables included in the model are as follows: Words: the total number of words in the ambiguous PP. Frequency: the average frequency per word in the ambiguous PP. Characters: the average number of characters per word in the ambiguous PP. Type low: a dummy variable to identify the attachment type of the PP. Head HEAD: dummy variables to identify the head word of the prepositional phrase (eg. head o f ). Ratio: the ratio of the VP head and NP2 head similarity to the NP1 and NP2 head similarity. Sub ject (a j, x y): dummy variables to identify the subject For this experiment, the data were analyzed using multiple regression. The general format for the model is based on method three presented in Lorch and Meyers (1990). There were 5 variables used for the model of this data. The reading time per character, which is the total reading time for the ambiguous prepositional phrase divided by the number of characters in the prepositional phrase, was treated as the dependent variable. The total number of words was also included in the model due to the effect it has on the reading time beyond the number of characters. The per-word frequency, that is the sum of probabilities of the words in the prepositional phrase, based on the smoothed frequency data above, divided by the number of words in the prepositional phrase, was used. Dummy encoding was used for the attachment type and the subject variables. Thus the type variable was 1 if the prepositional phrase attached to the noun phrase and 0 if it attached to the verb phrase. 9 binary variables were used for subject so that each had a unique variable that was 1 and the rest 0, and for the last subject all variables were 0. A sixth variable, the VP-NP ratio, was included in the model of the English data. This variable was created by dividing the Lesk similarity between the verbal nucleus head and the noun phrase 2 head by the Lesk similarity between the noun phrase1 head and the noun phrase 2 head. This measure was not included in the French model due to the lack of access to a French version of WordNet, as well as to the lack of evidence for the Lesk measure s accuracy in French. The Pearson s r correlation was used to check for co-linearity among the real valued variables. The distributions for each of those variables were also checked to verify that they could reasonably be approximated by a normal distribution using a histogram. In the case of the dummy variables, the conditional distribution of time per character was checked for normality at each of the levels. Once the data had been fit to the linear model that minimized the square of the residuals, the validity of the model was checked. To test for highly influential points, Cook s D was calculated and plotted along with a plot of leverage against the standardized residuals. The

residuals were plotted against the predicted time per character and the spread of the residuals versus each of the predictors was checked to verify the linearity of the model and the equality of variance assumption. Interactions with the Head variables were not included as it caused too many variable in the model to control for the variance due to Sub ject. The baseline model used is: Time = B s Sub ject + β 0 + β 1 Words+β 2 Characters+β 3 Frequency +B sw Sub ject Words +B sc Sub ject Characters +B s f Sub ject Frequency +ε i j The equation used to model both the French and the English data is: Time = B s Sub ject + β 0 + β 1 Words+β 2 Characters+β 3 Frequency +B h Head + B t Type +B sw Sub ject Words +B sc Sub ject Characters +B s f Sub ject Frequency +B sh Sub ject Head +B st Sub ject Type+ε i j One English models also includes terms for the similarity ratio: β 4 Ratio and B sr Sub ject Ratio. Separate models were fit with transformed data. The following transformations were made: log 10 (Time) was used instead of Time; Words was used instead of Words; and log 10 (Frequency) was used instead of Frequency. Also, each of the ratio variables was scaled to have zero mean and standard variance. No interactions between the within subject predictors were included in these models. This is based on two assumptions. One is that if a predictor is not significant itself, then interactions involving that predictor is not as likely to be significant. The other is that including too many predictors can cause important predictors to show up as insignificant [Howell (1992)]. For completeness, models that include all two way interactions for within subject predictors are included in appendix B. To compare the interaction between attachment type and language, a linear mixed model was used. The model was based on a model in Fox (2002). The equation is as follows with β

representing fixed effects and b representing random effects: Time i j = β 1 + β 2 Type i j + β 3 Language i +β 4 Language i Type i j + b i1 + b i2 Type i j + ε i j For this experiment, results were declared as significant at α = 0.05. All of the modeling and assumption checking was done using R. Any data for which parse data were missing from the French section were assumed to be missing at random with respect to the variables tested here. The other missing data were the Ratio information as described later. These were only left out of models that included the Ratio variable and the baseline models used for evaluating those models. Outliers were determined based on the histogram and quantile-quantile plots of the variables. As there was not a rigorous definition used, separate models were fit to the data that included outliers to check for any changes to the significance.

Chapter 4 Results 4.1 Variables used 4.1.1 Total reading time The total reading time measure is treated as the dependent variable. The distribution of the reading times in this experiment are similar to those found in other studies. In particular, it is evident from both the histograms [Fig. 4.2] and the quantile - quantile plots [Fig. 4.3] that the distribution is positively skewed. This is to be expected because reading times can not be less than 0. For this experiment, prepositional phrases that were not read are not included. Thus, there are no reading times of 0 recorded. This was done because there were far more 0 reading times than would be expected if these times were treated as part of the general distribution of reading times. On the quantile-quantile plot for the English reading times, there are three points that look like outliers. Two of these reading times are for a particularly long prepositional phrase, 58 words long, for two different subjects. As this phrase was an outlier for the number of words, it was not included. The third is for another subject on another long prepositional phrase, 30 Min. 52 1st Qu. 361 Median 619 Mean 942 3rd Qu. 1125 Max. 11605 (a) English Min. 52 1st Qu. 440 Median 740 Mean 1086 3rd Qu. 1200 Max. 11738 (b) French Figure 4.1: Total Reading Time 15

Frequency 0 1000 2000 3000 0 2000 6000 10000 Total Reading Time (a) English Frequency 0 400 800 1200 0 2000 6000 10000 Total Reading Time (b) French Figure 4.2: Histogram of Reading Time per Character

Sample Quantiles 0 4000 8000 12000 4 2 0 2 4 Theoretical Quantiles (a) English Sample Quantiles 0 4000 8000 12000 2 0 2 Theoretical Quantiles (b) French Figure 4.3: QQ Plot of Reading Time per Character

Min. 2.00 1st Qu. 3.00 Median 4.00 Mean 6.07 3rd Qu. 7.00 Max. 58.00 (a) English Min. 2.00 1st Qu. 3.00 Median 4.00 Mean 5.45 3rd Qu. 6.00 Max. 51.00 (b) French Figure 4.4: Number of Words words. This was not an outlier in the number of words for the phrase, and the other subjects did not have nearly as long reading times, so only the reading time for this subject on this prepositional phrase was excluded as an outlier. Two possible outliers are suggested by the quantile-quantile plot of the French reading times. These reading times are from two subjects on the same 51-word prepositional phrase. This phrase accounts for most of the longest reading time measures. As this phrase was an outlier in the number of words, it was not included in the model. 4.1.2 Number of words The plots of word length show that the distributions are positively skewed [Figs. 4.5 and 4.6]. This was expected as the minimum number of words in the phrase, by the definition used here, is two. The phrase must include at least a preposition and a word as the head of the noun phrase. The distribution of the number of words differs from reading time in that the mode of the number of words is the minimum. The fact that the number of words in a phrase is a discrete measure means the quantile - quantile plots show data grouped in horizontal lines. For use in the linear models, the number of words is treated as normal, and therefore continuous. The plots of the French data show a gap in the distribution between 40-word phrases 50- word phrases. The 51-word phrase mentioned earlier was the only phrase longer than 40 words and was removed from the data set. The quantile - quantile plot suggests there are phrases that are separated from the distribution. This is also evident in the histogram as the tail is not smooth. As there are multiple phrases in this category, they were not treated as outliers. As in the French data, there is one English phrase that stands out in word length. This phrase is the one mentioned earlier that is 58 words long. Since this is the only phrase in the English data longer than 50 words, it was considered an outlier and not included in the model as stated above.

Frequency 0 1000 3000 5000 0 10 20 30 40 50 60 Number of Words (a) English Frequency 0 500 1000 2000 0 10 20 30 40 50 Number of Words (b) French Figure 4.5: Histogram of Number of Words

Sample Quantiles 0 10 20 30 40 50 60 4 2 0 2 4 Theoretical Quantiles (a) English Sample Quantiles 10 20 30 40 50 2 0 2 Theoretical Quantiles (b) French Figure 4.6: QQ Plot of Number of Words

Min. 4.95e-05 Min. 4.91e-04 1st Qu. 2.62e-03 1st Qu. 5.53e-03 Median 5.08e-03 Median 8.41e-03 Mean 5.60e-03 Mean 1.03e-02 3rd Qu. 7.78e-03 3rd Qu. 1.52e-02 Max. 1.60e-02 Max. 2.45e-02 (a) English (b) French Figure 4.7: Average Frequency per Word 4.1.3 Average frequency per word The histograms of average frequency per word reveal the least smooth of the distributions of the variables used in this model [Figs. 4.8 and 4.9]. This is most likely due to these measures being dominated by the frequency of the head word of the prepositional phrase. This would explain the multi-modal look of the distributions. Furthermore, the histogram of the French frequencies has a peak at around 0.02. The phrases that are included in this part are all head word de or á, which are the most common head words [Fig. 4.14(b)]. The inclusion of the preposition in all of the measures, along with smoothing, also induces a minimum value for the frequency per word. This minimum value is the occurrence of the lowest frequency preposition along with a number of unknown words. Despite the multi-modal nature of distribution, it is treated as a normal distribution. The quantile-quantile plot shows that there are also heavier tails than would be expected in a normal distribution. Additionally, the plot is positively skewed. There are no clear outliers from the histograms or the quantile-quantile plots, thus no phrases were excluded based on the frequency-per-word measure. 4.1.4 Average number of characters per word The distribution of the average number of characters per word is the closest to a normal distribution without transformation [Figs. 4.1.4 and 4.1.4]. The quantile-quantile plots reveal that the distribution is slightly positively skewed, with heavier tails. These plots also show the same horizontal grouping as the number of words plot. This is due to the characters per word variable being the quotient of two integer valued variables, number of words, and number of characters. Thus, the variable is rational and not continuous. The distribution is still approximated by a normal distribution for the purposes of the linear model. There were no outliers apparent in the English data. One phrase was removed from the French data. The phrase removed was 9.67 characters per word, while the next highest was 8.5.

Frequency 0 200 600 1000 0.000 0.005 0.010 0.015 Average Frequency per Word (a) English Frequency 0 100 200 300 400 0.000 0.010 0.020 Average Frequency per Word (b) French Figure 4.8: Histogram of Average Frequency per Word

Sample Quantiles 0.000 0.005 0.010 0.015 4 2 0 2 4 Theoretical Quantiles (a) English Sample Quantiles 0.000 0.010 0.020 2 0 2 Theoretical Quantiles (b) French Figure 4.9: QQ Plot of Average Frequency per Word

Min. 2.00 1st Qu. 4.00 Median 4.78 Mean 4.85 3rd Qu. 5.50 Max. 8.75 (a) English Min. 1.67 1st Qu. 4.00 Median 5.00 Mean 5.00 3rd Qu. 5.69 Max. 9.67 (b) French Figure 4.10: Average Number of Characters per Word 4.1.5 Attachment type The percentage of attachment type is fairly similar between the two languages [Figs. 4.14(a) 4.14(b)]. The percentage of low attachment is a bit higher than the previously reported 59% Collins and Brooks (1995), which could be due to differences between British English and American English. Another possibility is that high attachments were more likely missed when tagging the data for the Dundee corpus. This difference was, however, assumed to be due to chance. Previous data were not available on the likelihood of high or low attachment in French, although some data have pointed toward low attachment being more likely Gaussier and Cancedda (2001). 4.1.6 Preposition head In each language one preposition head is clearly the most frequently occurring in ambiguous prepositional phrases, de in French and of in English [Fig. 4.14]. It is interesting to note that these prepositions serve roughly the same purpose in their respective languages. Preposition heads are important to include in the models because, without dividing the phrases up according to their heads, a model would make predictions that are heavily biased toward those for the dominant preposition heads. 4.1.7 Similarity ratio A similarity ratio could not be calculated for quite a few tagged phrases. Often this resulted from a pronoun head word of either the NP1 or the NP2. There is no entry in WordNet for pronouns, and to get a correct similarity measure, the referent of the pronoun would be needed, so these phrases were not included. Infrequently used proper nouns and numbers also led to similarity ratios that could not be calculated. In the models that did not include the similarity ratio, these phrases were still included. In the model that did include the similarity ratio, it was assumed that excluding those phrases did not impact the model s predictions of significance.

Frequency 0 500 1000 2000 2 3 4 5 6 7 8 9 Average Number of Characters per Word (a) English Frequency 0 200 400 600 2 4 6 8 10 Average Number of Characters per Word (b) French Figure 4.11: Histogram of Average Characters per Word

Sample Quantiles 2 3 4 5 6 7 8 9 4 2 0 2 4 Theoretical Quantiles (a) English Sample Quantiles 2 4 6 8 2 0 2 Theoretical Quantiles (b) French Figure 4.12: QQ Plot of Average Characters per Word

High 3337 35% Low 6146 65% (a) English High 1343 39% Low 2085 61% (b) French Figure 4.13: Attachment Type of 2910 31% in 1475 16% for 912 9.6% on 701 7.4% to 697 7.3% with 541 5.7% (Other) 2247 24% (a) English de 1972 58% a 393 11% dans 330 9.6% en 182 5.3% sur 120 3.5% pour 112 3.3% (Other) 319 9.3% (b) French Figure 4.14: Head Word of the Prepositional Phrase Min. 1st Qu. Median Mean 3rd Qu. Max. NA s 8.80e-03 4.68e-01 8.40e-01 1.41e+00 1.50e+00 5.05e+01 1.44e+03 Figure 4.15: Similarity Ratio (English)

Frequency 0 2000 4000 6000 0 10 20 30 40 50 Similarity Ratio Figure 4.16: Histogram of Similarity Ratio Sample Quantiles 0 10 20 30 40 50 4 2 0 2 4 Theoretical Quantiles Figure 4.17: QQ Plot of Similarity Ratio

Time Words Characters Frequency Ratio Time 1.00 0.85 0.11-0.10 0.03 Words 0.85 1.00 0.01-0.05 0.02 Characters 0.11 0.01 1.00-0.29 0.01 Frequency -0.10-0.05-0.29 1.00 0.04 Ratio 0.03 0.02 0.01 0.04 1.00 (a) English Time Words Characters Frequency Time 1.00 0.86 0.09-0.12 Words 0.86 1.00-0.03-0.06 Characters 0.09-0.03 1.00-0.28 Frequency -0.12-0.06-0.28 1.00 (b) French Figure 4.18: Correlations Between Ratio Variables The similarity ratio is positively skewed for the same reason as the other variables [Fig. 4.16]. There can not be any ratio that is less than or equal to zero, although the ratio can theoretically be arbitrarily close to zero. There were two sentences that were considered outliers. The largest similarity ratio was 50.5 for the verb tie and the noun rope versus the noun legs. The next largest is 28, which was for the verb be and the noun activity compared with the noun Hitler. As the next largest was 19.6, those phrases were treated as outliers for the model that included similarity ratios. 4.2 Variable interaction The correlations reported in Fig. 4.18 are inflated since they are not broken down by subject. However, the pattern of results are generally as expected. The positive correlation between the number of words and the reading time is high. This is both intuitive (the longer a phrase, the longer it takes to read), as well as a well-established fact in linguistics. The other correlations that are expected are between average frequency per word, characters per word, and reading times. Words that are shorter are usually more frequent. If words are more frequent, they are also read faster. A phrase that consists of shorter words is generally read faster than a phrase with longer words if both phrase have the same number of words.

Estimate Std. Error t value Pr(> t ) Words 115.72 5.38 21.49 <2e-16 Characters 53.48 22.05 2.43 0.0153 Frequency -3165.42 4258.08-0.74 0.4573 Residual standard error: 486 on 3368 degrees of freedom Multiple R-Squared: 0.81, Adjusted R-squared: 0.808 F-statistic: 368 on 39 and 3368 DF, p-value: <2e-16 Figure 4.19: French baseline model results The box plots for reaction time do not show any clear differences between high attachment and low attachment. There is a slightly lower mean for low attachment in the French data. As mentioned earlier this effect could be simply due to reading time for de because of the proportion of phrases with de as the head. Plots regarding the two way relationships between variables are included in Appendix A. 4.3 Modeling results In all of the results presented, the variables that are related to this paper are reported. This includes the Words, Characters, Frequency, Type, Head, and when available, the Ratio. Additionally, any other variables that are measuring within subject variance and are significant are reported. 4.3.1 French models The baseline model [Fig. 4.19] shows that the number of words is indeed significant, as is the number of characters per word. Frequency is not significant. This is most likely due to the collinearity between the frequency-per-word measure and the characters-per-word measure. As the two measures are related, the variance is mostly explained without reference to the frequency per word. The model again shows that the number of words is significant. Neither the number of characters per word nor the frequency-per-word are now significant. None of the head word dummy variables are shown as none were significant. The type dummy variable did not turn out to be significant. There is a small improvement over the baseline model of the multiple R 2 statistic. Due to the extra variables, the adjusted R 2 is not higher. Figure 4.22 verifies that there is not a significant improvement over the baseline. Including the outliers in the model does not change which variables are significant [Fig. 4.23]. There is an improvement in the R 2 measure. Using the transformed variables for the model gives similar results [Fig. 4.24]. The Words is still significant. There is a higher t value for characters and a lower value for

Estimate Std. Error t value Pr(> t ) Words 114.784 5.552 20.68 <2e-16 Characters 48.386 23.339 2.07 0.0382 Frequency -7139.795 4982.621-1.43 0.1520 Type Low -20.303 68.144-0.30 0.7658 Residual standard error: 487 on 3165 degrees of freedom Multiple R-Squared: 0.821, Adjusted R-squared: 0.807 F-statistic: 59.8 on 242 and 3165 DF, p-value: <2e-16 Figure 4.20: French full model results Residuals vs Fitted d175 x172g89 Residuals 2000 0 2000 0 2000 6000 Fitted values lm(time Subj + Words + Chpw + Frqpw + Head + Type + Subj :Words + Subj... Figure 4.21: French plot of residuals against predicted value Res.Df RSS Df Sum of Sq F Pr(>F) 1 3368 7.95e+08 2 3157 7.46e+08 211 4.93e+07 0.99 0.53 Figure 4.22: French full model ANOVA comparison with baseline

Estimate Std. Error t value Pr(> t ) Words 116.21 5.04 23.08 <2e-16 Characters 43.61 23.29 1.87 0.0612 Frequency -6925.94 5060.99-1.37 0.1713 Type Low -23.05 69.29-0.33 0.7394 Residual standard error: 495 on 3185 degrees of freedom Multiple R-Squared: 0.841, Adjusted R-squared: 0.829 F-statistic: 69.4 on 242 and 3185 DF, p-value: <2e-16 Figure 4.23: French full model results including outliers Estimate Std. Error t value Pr(> t ) Words 0.705163 0.033658 20.95 <2e-16 Characters 0.127155 0.036110 3.52 0.00044 Frequency -0.030216 0.041230-0.73 0.46370 Head entre -1.084372 0.442674-2.45 0.01436 Type Low -0.105047 0.084329-1.25 0.21297 Residual standard error: 0.602 on 3165 degrees of freedom Multiple R-Squared: 0.664, Adjusted R-squared: 0.638 F-statistic: 25.8 on 242 and 3165 DF, p-value: <2e-16 Figure 4.24: French transformed full model results Frequency. The type dummy variable has a higher t value as well, although it is still nowhere near significant. Interestingly, one of the head word dummy variables is significant. Also, the R 2 is much lower than the untransformed models. An ANOVA comparison with a transformed version of the baseline model shows that including the type and head word dummy variables results in significant improvement [Fig. 4.25]. 4.3.2 English models For the English data, the baseline model yields results similar to the results of the French data [Fig. 4.26]. Again Words and Characters are significant while Frequency is not. The intercept is included in the table as it was only significant in this model. The R 2 values are lower for the Res.Df RSS Df Sum of Sq F Pr(>F) 1 3368 1238 2 3165 1146 203 91 1.24 0.013 Figure 4.25: French transformed model ANOVA comparison with baseline

Estimate Std. Error t value Pr(> t ) (Intercept) -333.58 81.47-4.09 4.3e-05 Words 138.87 2.75 50.48 <2e-16 Characters 69.77 13.84 5.04 4.7e-07 Frequency -3575.03 4339.73-0.82 0.410 Residual standard error: 462 on 9403 degrees of freedom Multiple R-Squared: 0.784, Adjusted R-squared: 0.783 F-statistic: 875 on 39 and 9403 DF, p-value: <2e-16 Figure 4.26: English baseline model results Estimate Std. Error t value Pr(> t ) Words 1.39e+02 2.80e+00 49.73 <2e-16 Characters 6.71e+01 1.48e+01 4.54 5.8e-06 Frequency -4.06e+03 4.78e+03-0.85 0.39614 Type Low 7.95e+00 3.95e+01 0.20 0.84047 Residual standard error: 465 on 9015 degrees of freedom Multiple R-Squared: 0.79, Adjusted R-squared: 0.78 F-statistic: 79.5 on 427 and 9015 DF, p-value: <2e-16 Figure 4.27: English model without ratio results English, but not drastically so. For comparison with the French model, a linear model was fit to the English data without the similarity ratio. The results are shown in figures 4.27 and 4.28. The Words variable is significant, and the Characters variable stayed significant despite the extra variables. The R 2 results were similar in that the multiple R 2 increased slightly while the adjusted R 2 decreased slightly. Including the similarity ratio in the model resulted in some improvement in the R 2 [Fig. 4.29]. However, neither the similarity ratio did not explain a significant amount of variance. The improvement was not significant [Fig. 4.31]. Using transformed variables resulted in the same changes as in the French data. The number of characters per word had a higher t value and the average frequency per word had a lower t value. The R 2 is also lower for the transformed variables. The ANOVA did not show a Res.Df RSS Df Sum of Sq F Pr(>F) 1 9403 2.01e+09 2 9015 1.95e+09 388 5.64e+07 0.67 1 Figure 4.28: English model without ratio ANOVA comparison with baseline

Estimate Std. Error t value Pr(> t ) Words 1.33e+02 2.91e+00 45.56 <2e-16 Chpw 6.02e+01 1.54e+01 3.91 9.22e-05 Frqpw -3.22e+03 4.97e+03-0.65 0.5181 Type Low 2.35e+01 4.11e+01 0.57 0.5683 Ratio 6.27e+00 9.29e+00 0.68 0.4996 Residual standard error: 442.5 on 7591 degrees of freedom Multiple R-Squared: 0.799, Adjusted R-squared: 0.788 F-statistic: 73.0 on 414 and 7591 DF, p-value: < 2.2e-16 Figure 4.29: English full model results Residuals vs Fitted h771 Residuals 6000 2000 2000 h882 h225 0 2000 4000 6000 8000 Fitted values lm(time Subj + Words + Chpw + Frqpw + Head + Type + VPNPratio + Subj :... Figure 4.30: English plot of residuals against predicted value Res.Df RSS Df Sum of Sq F Pr(>F) 1 7966 1.54e+9 2 7591 1.49e+9 375 5.52e+07 0.75 1 Figure 4.31: English full model ANOVA comparison with baseline