Ambiguous Prepositional Phrase Resolution by Humans. Joseph Houpt

Transcription

1 Ambiguous Prepositional Phrase Resolution by Humans Joseph Houpt Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2006

2 Abstract This paper examines the information humans use to deal with ambiguous prepositional phrase attachments. The work is done using a large corpus of eye-tracking data for both English and French. Multiple regression and linear mixed effect models are used to examine the significance of various factors. Variables that have been shown to effect reading time in experimental settings such as attachment type and the head words of the prepositional phrases are not found to be significant in most cases. There is a significant interaction between attachment type and language found. When the data was transformed there is also significance reported for some head words. i

3 Acknowledgements I am particularly thankful to my fiancee for her support, both mentally and editorially. I would also like to thank my family for the support they have given me and for forgiving my lack of communication during this project and my mother for her editorial help. Finally, I would like to thank my supervisor, Frank Keller, for the direction and input he has given me for this project. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Joseph Houpt) iii

5 Table of Contents 1 Introduction 1 2 Literature Review 5 3 Methods Description of the data used Data processing Data analysis Results Variables used Total reading time Number of words Average frequency per word Average number of characters per word Attachment type Preposition head Similarity ratio Variable interaction Modeling results French models English models English - French comparative model Discussion 37 A Two-way Relationship Plots 41 B Models Including Two-Way Interaction 48 B.1 English B.2 French iv

6 Bibliography 55 v

7 Chapter 1 Introduction The sentences we hear and read are processed effortlessly. The information is automatically extracted from the sentence as it is input. The process starts with taking the physical representation of a sentence to using the information held within the sentence. The collection of processes that make up this transformation is known as the human sentence processing mechanism (henceforth HSPM). There are many different levels at which the processing takes place. First, there is the translation of the physical form, whether it be sound or written text, to a form that can be interpreted by the brain. This information is then grouped in to words. The words can have a meaning, refer to something, and can play different roles in the sentence. This level is referred to as the lexical and semantic level of processing. The sentence also has a structure that determines the interactions between the words. This is referred to as the structural or syntactic level. Also, each sentence is normally part of a larger context such as a conversation or a text which is known as the discourse level. The way that these different levels interact in the process of interpreting a sentence has been the subject of much debate. There are those who claim that each level is interpreted totally separately from the others in different stages, with no interaction. This theory considers each level to be processed by separate informationally encapsulated processes that work in serial. There are other theories that claim that the different levels are used together to process the sentence. The most cited example of theory based on information encapsulation is the garden path model. This model treats sentence processing in two stages. The first stages create a structural representation of the sentence using only the grammatical categories of the words, along with general syntactic principles. The general rules of syntax are first applied, which include rules such as: a noun phrase consists of a determiner and a noun. There is some overlap in what these rules cover, so at some stages of processing there can be multiple possible structures. In this case, the HSPM is thought to defer to make a choice based on one of two principles: late 1

8 closure and minimal attachment. Late closure: When possible, attach incoming lexical items into the clause or phrase currently being processed. Minimal attachment: Attach incoming material into the phrase-marker being constructed using the fewest nodes consistent with the well-formedness rules of the language. (Frazier and Rayner, 1982, pg. 180) If the HSPM reaches a point in the sentence where the structure it has chosen to that point turns out not to be tenable, it will return to the point of ambiguity and try another option. This revision can be on the basis of structure, such as when there are words that are part of the sentence but there are no rules that would account for them. The sentence is also checked for semantic coherence at this point. If there is a structure that does not make sense, such as seeing with a fork, then the structure can be revised at this stage. Another theory holds that the HSPM uses some semantic information when it is building the structure from the beginning. As the sentence is parsed there is not such a strong separation between the use of the syntactic information and the semantic information. Instead, at a point of ambiguity the HSPM chooses a structure informed by the semantics of the words involved. There are various types of lexical information that can be used. In one theory, whether or not the phrase plays the part of an argument is an important factor in determining the structure of a sentence, referred to as the arguementhood of the phrase. The argument is considered to be as follows: If a phrase P is an argument of a head H, P fills a role in the relation described by H, the presence of which may be implied by H. P s contribution to the meaning of the sentence is a function of that role and hence depends on the particular identity of H. (Schütze and Gibson, 1999, pg. 410) In some sentences the structure can still not be determined with the semantics of those words. One example is The cop saw the robber with the binoculars. This sentence could mean either the cop was using the binoculars, or the robber had binoculars. In these cases, the discourse level of information is needed. There are some theories that claim that this information is used in making the structural decisions from the beginning instead of at later stages, for example [Crain and Steedman (1985)]. More recently, some theorists have argued that the frequencies of different structures are a factor used in determining the structure of a sentence. In this case when there are multiple choices for a structure, the more likely one is chosen. Which frequencies are used and how finely they are calculated is not always agreed upon. In one version of the theory a combination of semantic and statistical information is used. For example, if the verb is an action verb then it is more likely to have a prepositional phrase attached [MacDonald et al. (1994)]. The words themselves can also factor into the decision. For example if the preposition is of, then the prepositional phrase is more likely to attach to the noun phrase.

9 This paper uses these points of ambiguity to investigate the information that is important to the HSPM. The focus is on prepositional phrase attachment ambiguity, as it is the most common type and thus has the most data available. In the corpus used 43% of the English sentences and 34% of the French sentences contained an ambiguous prepositional phrase. Prepositional phrase ambiguity arises from the possibility of two different rules that could apply to the sequence: VP NP PP NP. One possibility is VP -> V NP and NP -> NP PP and the other is VP -> NP PP. The former is referred to as high attachment and the later as low attachment. In the sentence Jane at the salad with a fork, the prepositional phrase is normally interpreted to mean that the fork was used to eat the salad. This is the high attachment structure [Fig. 1]. The alternative, low attachment structure would be interpreted as the salad was in possession of a fork when Jane ate it [Fig. 1]. A better example of a low attachment sentence would be Jane at the salad from Tesco [Fig. 1].

10 There are some ambiguous prepositional phrases for which the choice of structure does not make a difference to the meaning. An example of this type from Hindle and Rooth (1993) is: The organization has opened a cleaning center in Seward. Here if the in Seward is attached to the verb has opened, the cleaning center is understood to be in Seward. Similarly, if in Seward is attached to the noun phrase cleaning center, the cleaning center is still understood to be in Seward. Hindle and Rooth refer to these these situations as semantically indeterminate [Hindle and Rooth (1993)].

11 Chapter 2 Literature Review To investigate the plausibility of the HSPM using the minimal attachment principle, Frazier and Rayner used eye-tracking. The sentences they use for the experiment include a disambiguation zone. Here, the HSPM is forced to choose a specific structure. They focus on the reading time for this zone, using a reading time per character measure. For the sentences they test, they found a significantly increased reading time for low attachment over high attachment [Frazier and Rayner (1982)]. One criticism was that the sentences used in Frazier and Rayner s experiments were presented in isolation. This does not account for context, which Crain and Steedman (1985) argue is used in the initial disambiguation decision In a later study, Altmann tested the extent to which biasing contextual information impacted the reading times. He also used a disambiguation zone to test the reading times. When there was no context presented, Altmann verified that reading times for minimal attachment sentences were faster. However, when the context was set up to bias the reader to a non-minimal attachment, he found that the minimal attachment reading time was slower [Altmann (1985)]. Within a corpus of natural text, the theory predicts that one attachment type is not any faster than the other as long as there is a biasing context. If reading times were found to be lower in general for high attachment decisions, this could be because context did not induce bias in a large percentage of cases. Thus, although the context biasing theory is intuitive and has been shown in an experimental setting, it is difficult to verify in a natural setting. The sentences used for the experiments were devised to create a specific bias. This works well in a limited experimental situation. However, finding evidence within a corpus of natural text is complicated by factors such as determining the biasing context for a significant number of sentences - decisions that could, in turn, be subject to disagreement. In this research, I assume it unlikely that naturally occurring text in context introduces a bias toward a certain attachment decision while another is intended. Another problem for the garden path model was that the same effects did not occur cross- 5

12 linguistically. Frazier did present evidence that the minimal attachment principle is used in Dutch [Frazier (1987)]. Cuetos and Mitchell (1988) present evidence based on both sentence completion and on-line testing that late closure is not a linguistic universal. Zagar et al. (1997) present evidence that early closure is preferred in French as well. To adapt the garden path theory to the cross-linguistic evidence, Frazier presented construal theory [Frazier and Jr. (1997)]. As the construal theory does not make any distinction in the strength of attachment preference between languages, there would be no interaction of language and attachment type in the reading time prediction. An alternative approach that Mitchell and Cuetos consider to reconcile the evidence of early closure preference in Spanish is based on the statistics of the language [Mitchell and Cuetos (1991) as cited in Zagar et al. (1997)]. This theory is often referred to as the linguistic tuning hypothesis. It predicts that the HSPM prefers the structure that is most common to the language. Thus, if early closure is more common in French, then reading time will be faster for early closure sentences in French. Likewise, since late closure is more common in English, reading time is faster than for early closure sentences. MacDonald et al. (1994) develop the idea more thoroughly and suggest possible frequencies that are important to prepositional phrase disambiguation. They suggest that head word co-occurrence as well as prepositional head preference are important factors. Schütze (1995) reviewed many of the previous studies. He argues that these effects are more succinctly described with argument/modifier distinctions. He claims that arguementhood had not been properly controlled for, but would explain many of the results reported. A study followed in which Schütze and Goodman control for arguementhood and show that there is a significant impact on reading time [Schütze and Gibson (1999)]. Testing this theory on a large natural language corpus presents difficulties similar to those of Altmann s theory. Determining the arguementhood of a prepositional phrase must be done manually, and thus would consume a large amount of time. Furthermore, determining if the presence of the prepositional phrase is implied by a head word in a natural setting is not always straightforward and therefore subject to disagreement. One general consequence of this theory would be an interaction between head words of the noun phrase, the head of the verb phrase, and the attachment type. The rule above states the presence of the prepositional phrase is implied depending on the head word of the phrase to which it may attach. In testing their theory, Schütze and Gibson assume that if there is a head word that clearly implies the presence of a preposition phrase, then the reading time for that prepositional phrase would be fast in comparison with a phrase that does not have a head word that clearly implies it [Schütze and Gibson (1999)]. Although this implies a clear way to test the theory on natural language, it would require a lot of data, especially if the changes in reading time were particularly small. The lexical-frequency based position holds that the argument/adjunct distinction can be

13 reduced to relative frequencies [ MacDonald et al. (1994)]. Thus, if a prepositional phrase more commonly occurs with the head of the verb phrase than the head of the noun phrase, the high attachment is preferred, and visa versa. Schütze and Gibson (1999) compare the finding in their paper with frequency-based accounts using P(PPhead V Phead) compared to P(PPhead NPhead). There is a high correlation here between the co-occurrence of the words and their lexical similarity. Prepositional phrase attachment is also particularly problematic for machine parsers. Thus, there has been a large amount of research into what features are useful in determining the correct parse for a sentence. I assume that features that are particularly helpful to machine disambiguation are more likely to be used by the HSPM by virtue of the information they contain. Collins and Brooks (1995) report that based on the assumption that the prepositional phrase always attaches to the noun, 59% accuracy is achieved. This suggest that a general syntactic rule could be the basis for a structural choice for the HSPM. The percentage of attachment decisions that disambiguate to low attachment is not overwhelming. If the HSPM did use this default strategy, it would not be very efficient. Also, this would suggest that the default is contrary to the evidence in support of a high attachment default. Using the most likely attachment for each of the prepositional phrase head words increases the accuracy to 79% [Collins and Brooks (1995)]. MacDonald et al. (1994) do suggest that the preposition is used by the HSPM. If the head word is used by the HSPM in determining an initial attachment decision, then an attachment type - head word interaction is expected. With just the four head words, Ratnaparkhi et al. (1994) report that humans have an average accuracy of 88.2%, compared with 93.2% for the whole sentence. This suggests that a significant amount of information for the disambiguation can be found from just the head words, but that there is also information available in the rest of the sentence. Also, the accuracy would theoretically increase if the sentences were given in context as well. This does not imply that these different pieces of information are used by the HSPM in the initial attachment decision, just that it is used at some point. Using a model based on the head words alone, Collins and Brooks (1995) report 84.1% accuracy for machine disambiguation. Another model using transformation based learning with just the head words achieved 80.8% accuracy. Thirteen of the top 20 transformations were based on the preposition alone [Brill and Resnik (1994)]. Many of the computational approaches to PP disambiguation have increased accuracy when some type of semantic information is included. A variety of methods have been used to include this information. Ratnaparkhi et al. (1994) use mutual information clustering to classify the head words. Including classes in their model increases the accuracy from 77.7% to 81.6%. Brill and Resnik (1994) classify the head words base on WordNet to reach an accuracy of

14 81.8% accuracy. Although this accuracy is not much higher than the 80.8% accuracy without class information, far fewer transformations were required. This suggests that the information contained in the words themselves can be abstracted to information about the semantic class of the words. Budanitsky and Hirst argue that to measure lexical similarity, WordNet based and other semantically based approaches are superior, in part due to the fact that co-occurrence is not necessarily a metric [Budanitsky and Hirst (2006)]. If this is the case, it would be worthwhile to compare the predictive influences of a co-occurrence based metric and a semantic-based measure for reading time to determine whether the bias is indeed purely frequency based or is based more on similarity. This measure would be limited to the similarity of the head words of the verb phrase and the two noun phrases, as there is no semantic corpus relating those head words to prepositions.

15 Chapter 3 Methods 3.1 Description of the data used The data used in this research are from the Dundee Eye-tracking Corpus. This corpus contains eye-tracking data for 20 subjects, 10 native French speakers and 10 native English speakers. Each subject read 800 screens of 5 lines of text per screen, for a total of 4000 lines in their respective native language. The total word counts are 51,502 in English and 47,445 in French. The French text for the Dundee corpus is from editorials in the French newspaper, Le Monde. This text is a subset of the text used for the French Treebank allowing for crossreferencing of the data. The version of the French Treebank available included full syntactic parses for 1,081 of the 1,990 sentences in the Dundee data. The English text is from editorials published in the British newspaper The Independent. There was no syntactic information available for this text. The text was tagged using the TnT software [Brants (2000)] trained on the Wall Street Journal section of the Penn Treebank [Marcus et al. (1994)]. The sentences were then filtered based on whether they could contain a syntactically ambiguous prepositional phrase attachment. The criteria used was meant to be broad enough that all cases would appear in the filtered data set, at the expense of allowing too many sentences through. The criteria were as follows: noun = /(NN CD LS PRP$? WP WP$? DT) / verb := /VB(D G N P Z)/ prepositon := /(IN TO )/ /.* verb.* noun.* preposition.* noun.*/ The sentence must contain some type of verb, followed by some type of noun, number, or some type of pronoun. Then, either a word tagged as IN that could be used as a preposition or TO must follow. Finally, there must be a noun, number, or pronoun after the potential preposition. The words were required to be in this order, but not necessarily adjacent. Once filtered, the sentences were manually checked for an ambiguous prepositional phrase attachment structure. 9

16 Figure 3.1: High ambiguous PP attachement Figure 3.2: Low ambiguous PP attachement Syntactic information pertaining to the identification of the prepositional phrase was added (verbal nucleus; noun phrase 1; prepositional phrase; noun phrase 2) to sentences of this type. The definition of a verbal nucleus is based on that used for the French Treebank and is adapted for use in English. The verbal nucleus is defined as clitics, auxiliaries, negation and verb [Laboratoire de Linguistique Formelle (2006)]. NP1 refers to the noun phrase that is the immediate child of the verb phrase and is either the parent or the sibling of the prepositional phrase in question. The noun phrase object of the prepositional phrase is referred to as NP2. An ambiguous prepositional phrase structure was of one of two types. If the prepositional phrase was attached to the noun phrase (NP1), then there would be a verbal nucleus with its immediate sibling to the right NP1, and the last child of NP1 would be the prepositional phrase (see Figure 3.2). For the case of high attachment, the verbal nucleus, noun phrase 1, and the prepositional phrase would be immediate siblings in that order (see Figure 3.1). In both cases the object of the prepositional phrase must be a noun phrase (NP2). In cases where the attachment type was not clear, similar sentences with a consistent attachment from the Penn Treebank were used to determine the structure. A best guess was used if there were not similar situations or if there were conflicting attachment types. This only occurred on around 5% of sentences. One hundred sentences were randomly selected and at-

17 tachment choices were made for these by a second person to calculate the level of agreement. 82.2% were disambiguated in the same manner resulting in Cohen s κ agreement of.68. A previous study showed 91.3% agreement when disambiguating prepositional phrase attachments in English [Ratnaparkhi et al. (1994)]. Frequency information for English was taken from the written section of the British National Corpus as harvested by Adam Kilgarriff. French frequency was based on the text on the CD-ROM du Monde Diplomatique ( ) as harvested by Jean Véronis. Both frequencies were smoothed using Church-Turing for words that occurred fewer than 10 times. 3.2 Data processing To align the eye-tracking data from the Dundee Corpus with the syntactic information from the French Treebank, all non-letter characters were removed. The remaining letters were converted to lower case. The accents of all characters were also removed, due to inconsistency in the accents between the Treebank and the Dundee Corpus. The parsed sentences were then collected along with a line cross-referencing them with the Dundee Corpus. If there was not an available match for the sentence, but the previous and next sentences had exactly one line between them in the other corpus, then those sentences were assumed to correspond. In each of these cases the correspondence was verified manually. Although the same process was not necessary for the English data, the tagged sentences were treated as a separate corpus and processed the same as the French data to maintain consistency in data format. Each sentence with syntactic information was then checked for prepositional phrase attachments that were syntactically ambiguous. The criteria used for finding ambiguities in English described in Section 3.1 were used. The references to the Dundee Corpus were then returned for those cases of syntactic ambiguity along with whether the attachment was high or low. The head words for each of the verbal nucleus, noun phrase 1, prepositional phrase, and noun phrase 2 were also extracted at this point. For the English sentences, the rules used to find the head words were those used by David Magerman, with the exception that possessive noun phrases were not marked up and thus not treated separately [Collins and Magerman (1995)]. The French head-finding rules used were those developed by Abhishek Arun [Arun (2004)]. In the final step of preprocessing, Pederson, et al. s implementation [Pedersen et al. (2004)] of the Lesk similarity measure for words in the English language was used in conjunction with WordNet version 2.1 [Patwardhan et al. (2003)]. The similarity measure for the head word of NP2 and each of NP1 and the verbal nucleus was calculated.

18 3.3 Data analysis The variables included in the model are as follows: Words: the total number of words in the ambiguous PP. Frequency: the average frequency per word in the ambiguous PP. Characters: the average number of characters per word in the ambiguous PP. Type low: a dummy variable to identify the attachment type of the PP. Head HEAD: dummy variables to identify the head word of the prepositional phrase (eg. head o f ). Ratio: the ratio of the VP head and NP2 head similarity to the NP1 and NP2 head similarity. Sub ject (a j, x y): dummy variables to identify the subject For this experiment, the data were analyzed using multiple regression. The general format for the model is based on method three presented in Lorch and Meyers (1990). There were 5 variables used for the model of this data. The reading time per character, which is the total reading time for the ambiguous prepositional phrase divided by the number of characters in the prepositional phrase, was treated as the dependent variable. The total number of words was also included in the model due to the effect it has on the reading time beyond the number of characters. The per-word frequency, that is the sum of probabilities of the words in the prepositional phrase, based on the smoothed frequency data above, divided by the number of words in the prepositional phrase, was used. Dummy encoding was used for the attachment type and the subject variables. Thus the type variable was 1 if the prepositional phrase attached to the noun phrase and 0 if it attached to the verb phrase. 9 binary variables were used for subject so that each had a unique variable that was 1 and the rest 0, and for the last subject all variables were 0. A sixth variable, the VP-NP ratio, was included in the model of the English data. This variable was created by dividing the Lesk similarity between the verbal nucleus head and the noun phrase 2 head by the Lesk similarity between the noun phrase1 head and the noun phrase 2 head. This measure was not included in the French model due to the lack of access to a French version of WordNet, as well as to the lack of evidence for the Lesk measure s accuracy in French. The Pearson s r correlation was used to check for co-linearity among the real valued variables. The distributions for each of those variables were also checked to verify that they could reasonably be approximated by a normal distribution using a histogram. In the case of the dummy variables, the conditional distribution of time per character was checked for normality at each of the levels. Once the data had been fit to the linear model that minimized the square of the residuals, the validity of the model was checked. To test for highly influential points, Cook s D was calculated and plotted along with a plot of leverage against the standardized residuals. The

19 residuals were plotted against the predicted time per character and the spread of the residuals versus each of the predictors was checked to verify the linearity of the model and the equality of variance assumption. Interactions with the Head variables were not included as it caused too many variable in the model to control for the variance due to Sub ject. The baseline model used is: Time = B s Sub ject + β 0 + β 1 Words+β 2 Characters+β 3 Frequency +B sw Sub ject Words +B sc Sub ject Characters +B s f Sub ject Frequency +ε i j The equation used to model both the French and the English data is: Time = B s Sub ject + β 0 + β 1 Words+β 2 Characters+β 3 Frequency +B h Head + B t Type +B sw Sub ject Words +B sc Sub ject Characters +B s f Sub ject Frequency +B sh Sub ject Head +B st Sub ject Type+ε i j One English models also includes terms for the similarity ratio: β 4 Ratio and B sr Sub ject Ratio. Separate models were fit with transformed data. The following transformations were made: log 10 (Time) was used instead of Time; Words was used instead of Words; and log 10 (Frequency) was used instead of Frequency. Also, each of the ratio variables was scaled to have zero mean and standard variance. No interactions between the within subject predictors were included in these models. This is based on two assumptions. One is that if a predictor is not significant itself, then interactions involving that predictor is not as likely to be significant. The other is that including too many predictors can cause important predictors to show up as insignificant [Howell (1992)]. For completeness, models that include all two way interactions for within subject predictors are included in appendix B. To compare the interaction between attachment type and language, a linear mixed model was used. The model was based on a model in Fox (2002). The equation is as follows with β

20 representing fixed effects and b representing random effects: Time i j = β 1 + β 2 Type i j + β 3 Language i +β 4 Language i Type i j + b i1 + b i2 Type i j + ε i j For this experiment, results were declared as significant at α = All of the modeling and assumption checking was done using R. Any data for which parse data were missing from the French section were assumed to be missing at random with respect to the variables tested here. The other missing data were the Ratio information as described later. These were only left out of models that included the Ratio variable and the baseline models used for evaluating those models. Outliers were determined based on the histogram and quantile-quantile plots of the variables. As there was not a rigorous definition used, separate models were fit to the data that included outliers to check for any changes to the significance.

21 Chapter 4 Results 4.1 Variables used Total reading time The total reading time measure is treated as the dependent variable. The distribution of the reading times in this experiment are similar to those found in other studies. In particular, it is evident from both the histograms [Fig. 4.2] and the quantile - quantile plots [Fig. 4.3] that the distribution is positively skewed. This is to be expected because reading times can not be less than 0. For this experiment, prepositional phrases that were not read are not included. Thus, there are no reading times of 0 recorded. This was done because there were far more 0 reading times than would be expected if these times were treated as part of the general distribution of reading times. On the quantile-quantile plot for the English reading times, there are three points that look like outliers. Two of these reading times are for a particularly long prepositional phrase, 58 words long, for two different subjects. As this phrase was an outlier for the number of words, it was not included. The third is for another subject on another long prepositional phrase, 30 Min. 52 1st Qu. 361 Median 619 Mean 942 3rd Qu Max (a) English Min. 52 1st Qu. 440 Median 740 Mean rd Qu Max (b) French Figure 4.1: Total Reading Time 15

22 Frequency Total Reading Time (a) English Frequency Total Reading Time (b) French Figure 4.2: Histogram of Reading Time per Character

23 Sample Quantiles Theoretical Quantiles (a) English Sample Quantiles Theoretical Quantiles (b) French Figure 4.3: QQ Plot of Reading Time per Character

24 Min st Qu Median 4.00 Mean rd Qu Max (a) English Min st Qu Median 4.00 Mean rd Qu Max (b) French Figure 4.4: Number of Words words. This was not an outlier in the number of words for the phrase, and the other subjects did not have nearly as long reading times, so only the reading time for this subject on this prepositional phrase was excluded as an outlier. Two possible outliers are suggested by the quantile-quantile plot of the French reading times. These reading times are from two subjects on the same 51-word prepositional phrase. This phrase accounts for most of the longest reading time measures. As this phrase was an outlier in the number of words, it was not included in the model Number of words The plots of word length show that the distributions are positively skewed [Figs. 4.5 and 4.6]. This was expected as the minimum number of words in the phrase, by the definition used here, is two. The phrase must include at least a preposition and a word as the head of the noun phrase. The distribution of the number of words differs from reading time in that the mode of the number of words is the minimum. The fact that the number of words in a phrase is a discrete measure means the quantile - quantile plots show data grouped in horizontal lines. For use in the linear models, the number of words is treated as normal, and therefore continuous. The plots of the French data show a gap in the distribution between 40-word phrases 50- word phrases. The 51-word phrase mentioned earlier was the only phrase longer than 40 words and was removed from the data set. The quantile - quantile plot suggests there are phrases that are separated from the distribution. This is also evident in the histogram as the tail is not smooth. As there are multiple phrases in this category, they were not treated as outliers. As in the French data, there is one English phrase that stands out in word length. This phrase is the one mentioned earlier that is 58 words long. Since this is the only phrase in the English data longer than 50 words, it was considered an outlier and not included in the model as stated above.

25 Frequency Number of Words (a) English Frequency Number of Words (b) French Figure 4.5: Histogram of Number of Words

26 Sample Quantiles Theoretical Quantiles (a) English Sample Quantiles Theoretical Quantiles (b) French Figure 4.6: QQ Plot of Number of Words

27 Min. 4.95e-05 Min. 4.91e-04 1st Qu. 2.62e-03 1st Qu. 5.53e-03 Median 5.08e-03 Median 8.41e-03 Mean 5.60e-03 Mean 1.03e-02 3rd Qu. 7.78e-03 3rd Qu. 1.52e-02 Max. 1.60e-02 Max. 2.45e-02 (a) English (b) French Figure 4.7: Average Frequency per Word Average frequency per word The histograms of average frequency per word reveal the least smooth of the distributions of the variables used in this model [Figs. 4.8 and 4.9]. This is most likely due to these measures being dominated by the frequency of the head word of the prepositional phrase. This would explain the multi-modal look of the distributions. Furthermore, the histogram of the French frequencies has a peak at around The phrases that are included in this part are all head word de or á, which are the most common head words [Fig. 4.14(b)]. The inclusion of the preposition in all of the measures, along with smoothing, also induces a minimum value for the frequency per word. This minimum value is the occurrence of the lowest frequency preposition along with a number of unknown words. Despite the multi-modal nature of distribution, it is treated as a normal distribution. The quantile-quantile plot shows that there are also heavier tails than would be expected in a normal distribution. Additionally, the plot is positively skewed. There are no clear outliers from the histograms or the quantile-quantile plots, thus no phrases were excluded based on the frequency-per-word measure Average number of characters per word The distribution of the average number of characters per word is the closest to a normal distribution without transformation [Figs and 4.1.4]. The quantile-quantile plots reveal that the distribution is slightly positively skewed, with heavier tails. These plots also show the same horizontal grouping as the number of words plot. This is due to the characters per word variable being the quotient of two integer valued variables, number of words, and number of characters. Thus, the variable is rational and not continuous. The distribution is still approximated by a normal distribution for the purposes of the linear model. There were no outliers apparent in the English data. One phrase was removed from the French data. The phrase removed was 9.67 characters per word, while the next highest was 8.5.

28 Frequency Average Frequency per Word (a) English Frequency Average Frequency per Word (b) French Figure 4.8: Histogram of Average Frequency per Word

29 Sample Quantiles Theoretical Quantiles (a) English Sample Quantiles Theoretical Quantiles (b) French Figure 4.9: QQ Plot of Average Frequency per Word

30 Min st Qu Median 4.78 Mean rd Qu Max (a) English Min st Qu Median 5.00 Mean rd Qu Max (b) French Figure 4.10: Average Number of Characters per Word Attachment type The percentage of attachment type is fairly similar between the two languages [Figs. 4.14(a) 4.14(b)]. The percentage of low attachment is a bit higher than the previously reported 59% Collins and Brooks (1995), which could be due to differences between British English and American English. Another possibility is that high attachments were more likely missed when tagging the data for the Dundee corpus. This difference was, however, assumed to be due to chance. Previous data were not available on the likelihood of high or low attachment in French, although some data have pointed toward low attachment being more likely Gaussier and Cancedda (2001) Preposition head In each language one preposition head is clearly the most frequently occurring in ambiguous prepositional phrases, de in French and of in English [Fig. 4.14]. It is interesting to note that these prepositions serve roughly the same purpose in their respective languages. Preposition heads are important to include in the models because, without dividing the phrases up according to their heads, a model would make predictions that are heavily biased toward those for the dominant preposition heads Similarity ratio A similarity ratio could not be calculated for quite a few tagged phrases. Often this resulted from a pronoun head word of either the NP1 or the NP2. There is no entry in WordNet for pronouns, and to get a correct similarity measure, the referent of the pronoun would be needed, so these phrases were not included. Infrequently used proper nouns and numbers also led to similarity ratios that could not be calculated. In the models that did not include the similarity ratio, these phrases were still included. In the model that did include the similarity ratio, it was assumed that excluding those phrases did not impact the model s predictions of significance.

31 Frequency Average Number of Characters per Word (a) English Frequency Average Number of Characters per Word (b) French Figure 4.11: Histogram of Average Characters per Word

32 Sample Quantiles Theoretical Quantiles (a) English Sample Quantiles Theoretical Quantiles (b) French Figure 4.12: QQ Plot of Average Characters per Word

33 High % Low % (a) English High % Low % (b) French Figure 4.13: Attachment Type of % in % for % on % to % with % (Other) % (a) English de % a % dans % en % sur % pour % (Other) % (b) French Figure 4.14: Head Word of the Prepositional Phrase Min. 1st Qu. Median Mean 3rd Qu. Max. NA s 8.80e e e e e e e+03 Figure 4.15: Similarity Ratio (English)

34 Frequency Similarity Ratio Figure 4.16: Histogram of Similarity Ratio Sample Quantiles Theoretical Quantiles Figure 4.17: QQ Plot of Similarity Ratio

35 Time Words Characters Frequency Ratio Time Words Characters Frequency Ratio (a) English Time Words Characters Frequency Time Words Characters Frequency (b) French Figure 4.18: Correlations Between Ratio Variables The similarity ratio is positively skewed for the same reason as the other variables [Fig. 4.16]. There can not be any ratio that is less than or equal to zero, although the ratio can theoretically be arbitrarily close to zero. There were two sentences that were considered outliers. The largest similarity ratio was 50.5 for the verb tie and the noun rope versus the noun legs. The next largest is 28, which was for the verb be and the noun activity compared with the noun Hitler. As the next largest was 19.6, those phrases were treated as outliers for the model that included similarity ratios. 4.2 Variable interaction The correlations reported in Fig are inflated since they are not broken down by subject. However, the pattern of results are generally as expected. The positive correlation between the number of words and the reading time is high. This is both intuitive (the longer a phrase, the longer it takes to read), as well as a well-established fact in linguistics. The other correlations that are expected are between average frequency per word, characters per word, and reading times. Words that are shorter are usually more frequent. If words are more frequent, they are also read faster. A phrase that consists of shorter words is generally read faster than a phrase with longer words if both phrase have the same number of words.

36 Estimate Std. Error t value Pr(> t ) Words <2e-16 Characters Frequency Residual standard error: 486 on 3368 degrees of freedom Multiple R-Squared: 0.81, Adjusted R-squared: F-statistic: 368 on 39 and 3368 DF, p-value: <2e-16 Figure 4.19: French baseline model results The box plots for reaction time do not show any clear differences between high attachment and low attachment. There is a slightly lower mean for low attachment in the French data. As mentioned earlier this effect could be simply due to reading time for de because of the proportion of phrases with de as the head. Plots regarding the two way relationships between variables are included in Appendix A. 4.3 Modeling results In all of the results presented, the variables that are related to this paper are reported. This includes the Words, Characters, Frequency, Type, Head, and when available, the Ratio. Additionally, any other variables that are measuring within subject variance and are significant are reported French models The baseline model [Fig. 4.19] shows that the number of words is indeed significant, as is the number of characters per word. Frequency is not significant. This is most likely due to the collinearity between the frequency-per-word measure and the characters-per-word measure. As the two measures are related, the variance is mostly explained without reference to the frequency per word. The model again shows that the number of words is significant. Neither the number of characters per word nor the frequency-per-word are now significant. None of the head word dummy variables are shown as none were significant. The type dummy variable did not turn out to be significant. There is a small improvement over the baseline model of the multiple R 2 statistic. Due to the extra variables, the adjusted R 2 is not higher. Figure 4.22 verifies that there is not a significant improvement over the baseline. Including the outliers in the model does not change which variables are significant [Fig. 4.23]. There is an improvement in the R 2 measure. Using the transformed variables for the model gives similar results [Fig. 4.24]. The Words is still significant. There is a higher t value for characters and a lower value for

37 Estimate Std. Error t value Pr(> t ) Words <2e-16 Characters Frequency Type Low Residual standard error: 487 on 3165 degrees of freedom Multiple R-Squared: 0.821, Adjusted R-squared: F-statistic: 59.8 on 242 and 3165 DF, p-value: <2e-16 Figure 4.20: French full model results Residuals vs Fitted d175 x172g89 Residuals Fitted values lm(time Subj + Words + Chpw + Frqpw + Head + Type + Subj :Words + Subj... Figure 4.21: French plot of residuals against predicted value Res.Df RSS Df Sum of Sq F Pr(>F) e e e Figure 4.22: French full model ANOVA comparison with baseline

38 Estimate Std. Error t value Pr(> t ) Words <2e-16 Characters Frequency Type Low Residual standard error: 495 on 3185 degrees of freedom Multiple R-Squared: 0.841, Adjusted R-squared: F-statistic: 69.4 on 242 and 3185 DF, p-value: <2e-16 Figure 4.23: French full model results including outliers Estimate Std. Error t value Pr(> t ) Words <2e-16 Characters Frequency Head entre Type Low Residual standard error: on 3165 degrees of freedom Multiple R-Squared: 0.664, Adjusted R-squared: F-statistic: 25.8 on 242 and 3165 DF, p-value: <2e-16 Figure 4.24: French transformed full model results Frequency. The type dummy variable has a higher t value as well, although it is still nowhere near significant. Interestingly, one of the head word dummy variables is significant. Also, the R 2 is much lower than the untransformed models. An ANOVA comparison with a transformed version of the baseline model shows that including the type and head word dummy variables results in significant improvement [Fig. 4.25] English models For the English data, the baseline model yields results similar to the results of the French data [Fig. 4.26]. Again Words and Characters are significant while Frequency is not. The intercept is included in the table as it was only significant in this model. The R 2 values are lower for the Res.Df RSS Df Sum of Sq F Pr(>F) Figure 4.25: French transformed model ANOVA comparison with baseline

39 Estimate Std. Error t value Pr(> t ) (Intercept) e-05 Words <2e-16 Characters e-07 Frequency Residual standard error: 462 on 9403 degrees of freedom Multiple R-Squared: 0.784, Adjusted R-squared: F-statistic: 875 on 39 and 9403 DF, p-value: <2e-16 Figure 4.26: English baseline model results Estimate Std. Error t value Pr(> t ) Words 1.39e e <2e-16 Characters 6.71e e e-06 Frequency -4.06e e Type Low 7.95e e Residual standard error: 465 on 9015 degrees of freedom Multiple R-Squared: 0.79, Adjusted R-squared: 0.78 F-statistic: 79.5 on 427 and 9015 DF, p-value: <2e-16 Figure 4.27: English model without ratio results English, but not drastically so. For comparison with the French model, a linear model was fit to the English data without the similarity ratio. The results are shown in figures 4.27 and The Words variable is significant, and the Characters variable stayed significant despite the extra variables. The R 2 results were similar in that the multiple R 2 increased slightly while the adjusted R 2 decreased slightly. Including the similarity ratio in the model resulted in some improvement in the R 2 [Fig. 4.29]. However, neither the similarity ratio did not explain a significant amount of variance. The improvement was not significant [Fig. 4.31]. Using transformed variables resulted in the same changes as in the French data. The number of characters per word had a higher t value and the average frequency per word had a lower t value. The R 2 is also lower for the transformed variables. The ANOVA did not show a Res.Df RSS Df Sum of Sq F Pr(>F) e e e Figure 4.28: English model without ratio ANOVA comparison with baseline

40 Estimate Std. Error t value Pr(> t ) Words 1.33e e <2e-16 Chpw 6.02e e e-05 Frqpw -3.22e e Type Low 2.35e e Ratio 6.27e e Residual standard error: on 7591 degrees of freedom Multiple R-Squared: 0.799, Adjusted R-squared: F-statistic: 73.0 on 414 and 7591 DF, p-value: < 2.2e-16 Figure 4.29: English full model results Residuals vs Fitted h771 Residuals h882 h Fitted values lm(time Subj + Words + Chpw + Frqpw + Head + Type + VPNPratio + Subj :... Figure 4.30: English plot of residuals against predicted value Res.Df RSS Df Sum of Sq F Pr(>F) e e e Figure 4.31: English full model ANOVA comparison with baseline