Ambiguous Prepositional Phrase Resolution by Humans. Joseph Houpt

Size: px
Start display at page:

Download "Ambiguous Prepositional Phrase Resolution by Humans. Joseph Houpt"

Transcription

1 Ambiguous Prepositional Phrase Resolution by Humans Joseph Houpt Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2006

2 Abstract This paper examines the information humans use to deal with ambiguous prepositional phrase attachments. The work is done using a large corpus of eye-tracking data for both English and French. Multiple regression and linear mixed effect models are used to examine the significance of various factors. Variables that have been shown to effect reading time in experimental settings such as attachment type and the head words of the prepositional phrases are not found to be significant in most cases. There is a significant interaction between attachment type and language found. When the data was transformed there is also significance reported for some head words. i

3 Acknowledgements I am particularly thankful to my fiancee for her support, both mentally and editorially. I would also like to thank my family for the support they have given me and for forgiving my lack of communication during this project and my mother for her editorial help. Finally, I would like to thank my supervisor, Frank Keller, for the direction and input he has given me for this project. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Joseph Houpt) iii

5 Table of Contents 1 Introduction 1 2 Literature Review 5 3 Methods Description of the data used Data processing Data analysis Results Variables used Total reading time Number of words Average frequency per word Average number of characters per word Attachment type Preposition head Similarity ratio Variable interaction Modeling results French models English models English - French comparative model Discussion 37 A Two-way Relationship Plots 41 B Models Including Two-Way Interaction 48 B.1 English B.2 French iv

6 Bibliography 55 v

7 Chapter 1 Introduction The sentences we hear and read are processed effortlessly. The information is automatically extracted from the sentence as it is input. The process starts with taking the physical representation of a sentence to using the information held within the sentence. The collection of processes that make up this transformation is known as the human sentence processing mechanism (henceforth HSPM). There are many different levels at which the processing takes place. First, there is the translation of the physical form, whether it be sound or written text, to a form that can be interpreted by the brain. This information is then grouped in to words. The words can have a meaning, refer to something, and can play different roles in the sentence. This level is referred to as the lexical and semantic level of processing. The sentence also has a structure that determines the interactions between the words. This is referred to as the structural or syntactic level. Also, each sentence is normally part of a larger context such as a conversation or a text which is known as the discourse level. The way that these different levels interact in the process of interpreting a sentence has been the subject of much debate. There are those who claim that each level is interpreted totally separately from the others in different stages, with no interaction. This theory considers each level to be processed by separate informationally encapsulated processes that work in serial. There are other theories that claim that the different levels are used together to process the sentence. The most cited example of theory based on information encapsulation is the garden path model. This model treats sentence processing in two stages. The first stages create a structural representation of the sentence using only the grammatical categories of the words, along with general syntactic principles. The general rules of syntax are first applied, which include rules such as: a noun phrase consists of a determiner and a noun. There is some overlap in what these rules cover, so at some stages of processing there can be multiple possible structures. In this case, the HSPM is thought to defer to make a choice based on one of two principles: late 1

8 closure and minimal attachment. Late closure: When possible, attach incoming lexical items into the clause or phrase currently being processed. Minimal attachment: Attach incoming material into the phrase-marker being constructed using the fewest nodes consistent with the well-formedness rules of the language. (Frazier and Rayner, 1982, pg. 180) If the HSPM reaches a point in the sentence where the structure it has chosen to that point turns out not to be tenable, it will return to the point of ambiguity and try another option. This revision can be on the basis of structure, such as when there are words that are part of the sentence but there are no rules that would account for them. The sentence is also checked for semantic coherence at this point. If there is a structure that does not make sense, such as seeing with a fork, then the structure can be revised at this stage. Another theory holds that the HSPM uses some semantic information when it is building the structure from the beginning. As the sentence is parsed there is not such a strong separation between the use of the syntactic information and the semantic information. Instead, at a point of ambiguity the HSPM chooses a structure informed by the semantics of the words involved. There are various types of lexical information that can be used. In one theory, whether or not the phrase plays the part of an argument is an important factor in determining the structure of a sentence, referred to as the arguementhood of the phrase. The argument is considered to be as follows: If a phrase P is an argument of a head H, P fills a role in the relation described by H, the presence of which may be implied by H. P s contribution to the meaning of the sentence is a function of that role and hence depends on the particular identity of H. (Schütze and Gibson, 1999, pg. 410) In some sentences the structure can still not be determined with the semantics of those words. One example is The cop saw the robber with the binoculars. This sentence could mean either the cop was using the binoculars, or the robber had binoculars. In these cases, the discourse level of information is needed. There are some theories that claim that this information is used in making the structural decisions from the beginning instead of at later stages, for example [Crain and Steedman (1985)]. More recently, some theorists have argued that the frequencies of different structures are a factor used in determining the structure of a sentence. In this case when there are multiple choices for a structure, the more likely one is chosen. Which frequencies are used and how finely they are calculated is not always agreed upon. In one version of the theory a combination of semantic and statistical information is used. For example, if the verb is an action verb then it is more likely to have a prepositional phrase attached [MacDonald et al. (1994)]. The words themselves can also factor into the decision. For example if the preposition is of, then the prepositional phrase is more likely to attach to the noun phrase.

9 This paper uses these points of ambiguity to investigate the information that is important to the HSPM. The focus is on prepositional phrase attachment ambiguity, as it is the most common type and thus has the most data available. In the corpus used 43% of the English sentences and 34% of the French sentences contained an ambiguous prepositional phrase. Prepositional phrase ambiguity arises from the possibility of two different rules that could apply to the sequence: VP NP PP NP. One possibility is VP -> V NP and NP -> NP PP and the other is VP -> NP PP. The former is referred to as high attachment and the later as low attachment. In the sentence Jane at the salad with a fork, the prepositional phrase is normally interpreted to mean that the fork was used to eat the salad. This is the high attachment structure [Fig. 1]. The alternative, low attachment structure would be interpreted as the salad was in possession of a fork when Jane ate it [Fig. 1]. A better example of a low attachment sentence would be Jane at the salad from Tesco [Fig. 1].

10 There are some ambiguous prepositional phrases for which the choice of structure does not make a difference to the meaning. An example of this type from Hindle and Rooth (1993) is: The organization has opened a cleaning center in Seward. Here if the in Seward is attached to the verb has opened, the cleaning center is understood to be in Seward. Similarly, if in Seward is attached to the noun phrase cleaning center, the cleaning center is still understood to be in Seward. Hindle and Rooth refer to these these situations as semantically indeterminate [Hindle and Rooth (1993)].

11 Chapter 2 Literature Review To investigate the plausibility of the HSPM using the minimal attachment principle, Frazier and Rayner used eye-tracking. The sentences they use for the experiment include a disambiguation zone. Here, the HSPM is forced to choose a specific structure. They focus on the reading time for this zone, using a reading time per character measure. For the sentences they test, they found a significantly increased reading time for low attachment over high attachment [Frazier and Rayner (1982)]. One criticism was that the sentences used in Frazier and Rayner s experiments were presented in isolation. This does not account for context, which Crain and Steedman (1985) argue is used in the initial disambiguation decision In a later study, Altmann tested the extent to which biasing contextual information impacted the reading times. He also used a disambiguation zone to test the reading times. When there was no context presented, Altmann verified that reading times for minimal attachment sentences were faster. However, when the context was set up to bias the reader to a non-minimal attachment, he found that the minimal attachment reading time was slower [Altmann (1985)]. Within a corpus of natural text, the theory predicts that one attachment type is not any faster than the other as long as there is a biasing context. If reading times were found to be lower in general for high attachment decisions, this could be because context did not induce bias in a large percentage of cases. Thus, although the context biasing theory is intuitive and has been shown in an experimental setting, it is difficult to verify in a natural setting. The sentences used for the experiments were devised to create a specific bias. This works well in a limited experimental situation. However, finding evidence within a corpus of natural text is complicated by factors such as determining the biasing context for a significant number of sentences - decisions that could, in turn, be subject to disagreement. In this research, I assume it unlikely that naturally occurring text in context introduces a bias toward a certain attachment decision while another is intended. Another problem for the garden path model was that the same effects did not occur cross- 5

12 linguistically. Frazier did present evidence that the minimal attachment principle is used in Dutch [Frazier (1987)]. Cuetos and Mitchell (1988) present evidence based on both sentence completion and on-line testing that late closure is not a linguistic universal. Zagar et al. (1997) present evidence that early closure is preferred in French as well. To adapt the garden path theory to the cross-linguistic evidence, Frazier presented construal theory [Frazier and Jr. (1997)]. As the construal theory does not make any distinction in the strength of attachment preference between languages, there would be no interaction of language and attachment type in the reading time prediction. An alternative approach that Mitchell and Cuetos consider to reconcile the evidence of early closure preference in Spanish is based on the statistics of the language [Mitchell and Cuetos (1991) as cited in Zagar et al. (1997)]. This theory is often referred to as the linguistic tuning hypothesis. It predicts that the HSPM prefers the structure that is most common to the language. Thus, if early closure is more common in French, then reading time will be faster for early closure sentences in French. Likewise, since late closure is more common in English, reading time is faster than for early closure sentences. MacDonald et al. (1994) develop the idea more thoroughly and suggest possible frequencies that are important to prepositional phrase disambiguation. They suggest that head word co-occurrence as well as prepositional head preference are important factors. Schütze (1995) reviewed many of the previous studies. He argues that these effects are more succinctly described with argument/modifier distinctions. He claims that arguementhood had not been properly controlled for, but would explain many of the results reported. A study followed in which Schütze and Goodman control for arguementhood and show that there is a significant impact on reading time [Schütze and Gibson (1999)]. Testing this theory on a large natural language corpus presents difficulties similar to those of Altmann s theory. Determining the arguementhood of a prepositional phrase must be done manually, and thus would consume a large amount of time. Furthermore, determining if the presence of the prepositional phrase is implied by a head word in a natural setting is not always straightforward and therefore subject to disagreement. One general consequence of this theory would be an interaction between head words of the noun phrase, the head of the verb phrase, and the attachment type. The rule above states the presence of the prepositional phrase is implied depending on the head word of the phrase to which it may attach. In testing their theory, Schütze and Gibson assume that if there is a head word that clearly implies the presence of a preposition phrase, then the reading time for that prepositional phrase would be fast in comparison with a phrase that does not have a head word that clearly implies it [Schütze and Gibson (1999)]. Although this implies a clear way to test the theory on natural language, it would require a lot of data, especially if the changes in reading time were particularly small. The lexical-frequency based position holds that the argument/adjunct distinction can be

13 reduced to relative frequencies [ MacDonald et al. (1994)]. Thus, if a prepositional phrase more commonly occurs with the head of the verb phrase than the head of the noun phrase, the high attachment is preferred, and visa versa. Schütze and Gibson (1999) compare the finding in their paper with frequency-based accounts using P(PPhead V Phead) compared to P(PPhead NPhead). There is a high correlation here between the co-occurrence of the words and their lexical similarity. Prepositional phrase attachment is also particularly problematic for machine parsers. Thus, there has been a large amount of research into what features are useful in determining the correct parse for a sentence. I assume that features that are particularly helpful to machine disambiguation are more likely to be used by the HSPM by virtue of the information they contain. Collins and Brooks (1995) report that based on the assumption that the prepositional phrase always attaches to the noun, 59% accuracy is achieved. This suggest that a general syntactic rule could be the basis for a structural choice for the HSPM. The percentage of attachment decisions that disambiguate to low attachment is not overwhelming. If the HSPM did use this default strategy, it would not be very efficient. Also, this would suggest that the default is contrary to the evidence in support of a high attachment default. Using the most likely attachment for each of the prepositional phrase head words increases the accuracy to 79% [Collins and Brooks (1995)]. MacDonald et al. (1994) do suggest that the preposition is used by the HSPM. If the head word is used by the HSPM in determining an initial attachment decision, then an attachment type - head word interaction is expected. With just the four head words, Ratnaparkhi et al. (1994) report that humans have an average accuracy of 88.2%, compared with 93.2% for the whole sentence. This suggests that a significant amount of information for the disambiguation can be found from just the head words, but that there is also information available in the rest of the sentence. Also, the accuracy would theoretically increase if the sentences were given in context as well. This does not imply that these different pieces of information are used by the HSPM in the initial attachment decision, just that it is used at some point. Using a model based on the head words alone, Collins and Brooks (1995) report 84.1% accuracy for machine disambiguation. Another model using transformation based learning with just the head words achieved 80.8% accuracy. Thirteen of the top 20 transformations were based on the preposition alone [Brill and Resnik (1994)]. Many of the computational approaches to PP disambiguation have increased accuracy when some type of semantic information is included. A variety of methods have been used to include this information. Ratnaparkhi et al. (1994) use mutual information clustering to classify the head words. Including classes in their model increases the accuracy from 77.7% to 81.6%. Brill and Resnik (1994) classify the head words base on WordNet to reach an accuracy of

14 81.8% accuracy. Although this accuracy is not much higher than the 80.8% accuracy without class information, far fewer transformations were required. This suggests that the information contained in the words themselves can be abstracted to information about the semantic class of the words. Budanitsky and Hirst argue that to measure lexical similarity, WordNet based and other semantically based approaches are superior, in part due to the fact that co-occurrence is not necessarily a metric [Budanitsky and Hirst (2006)]. If this is the case, it would be worthwhile to compare the predictive influences of a co-occurrence based metric and a semantic-based measure for reading time to determine whether the bias is indeed purely frequency based or is based more on similarity. This measure would be limited to the similarity of the head words of the verb phrase and the two noun phrases, as there is no semantic corpus relating those head words to prepositions.

15 Chapter 3 Methods 3.1 Description of the data used The data used in this research are from the Dundee Eye-tracking Corpus. This corpus contains eye-tracking data for 20 subjects, 10 native French speakers and 10 native English speakers. Each subject read 800 screens of 5 lines of text per screen, for a total of 4000 lines in their respective native language. The total word counts are 51,502 in English and 47,445 in French. The French text for the Dundee corpus is from editorials in the French newspaper, Le Monde. This text is a subset of the text used for the French Treebank allowing for crossreferencing of the data. The version of the French Treebank available included full syntactic parses for 1,081 of the 1,990 sentences in the Dundee data. The English text is from editorials published in the British newspaper The Independent. There was no syntactic information available for this text. The text was tagged using the TnT software [Brants (2000)] trained on the Wall Street Journal section of the Penn Treebank [Marcus et al. (1994)]. The sentences were then filtered based on whether they could contain a syntactically ambiguous prepositional phrase attachment. The criteria used was meant to be broad enough that all cases would appear in the filtered data set, at the expense of allowing too many sentences through. The criteria were as follows: noun = /(NN CD LS PRP$? WP WP$? DT) / verb := /VB(D G N P Z)/ prepositon := /(IN TO )/ /.* verb.* noun.* preposition.* noun.*/ The sentence must contain some type of verb, followed by some type of noun, number, or some type of pronoun. Then, either a word tagged as IN that could be used as a preposition or TO must follow. Finally, there must be a noun, number, or pronoun after the potential preposition. The words were required to be in this order, but not necessarily adjacent. Once filtered, the sentences were manually checked for an ambiguous prepositional phrase attachment structure. 9

16 Figure 3.1: High ambiguous PP attachement Figure 3.2: Low ambiguous PP attachement Syntactic information pertaining to the identification of the prepositional phrase was added (verbal nucleus; noun phrase 1; prepositional phrase; noun phrase 2) to sentences of this type. The definition of a verbal nucleus is based on that used for the French Treebank and is adapted for use in English. The verbal nucleus is defined as clitics, auxiliaries, negation and verb [Laboratoire de Linguistique Formelle (2006)]. NP1 refers to the noun phrase that is the immediate child of the verb phrase and is either the parent or the sibling of the prepositional phrase in question. The noun phrase object of the prepositional phrase is referred to as NP2. An ambiguous prepositional phrase structure was of one of two types. If the prepositional phrase was attached to the noun phrase (NP1), then there would be a verbal nucleus with its immediate sibling to the right NP1, and the last child of NP1 would be the prepositional phrase (see Figure 3.2). For the case of high attachment, the verbal nucleus, noun phrase 1, and the prepositional phrase would be immediate siblings in that order (see Figure 3.1). In both cases the object of the prepositional phrase must be a noun phrase (NP2). In cases where the attachment type was not clear, similar sentences with a consistent attachment from the Penn Treebank were used to determine the structure. A best guess was used if there were not similar situations or if there were conflicting attachment types. This only occurred on around 5% of sentences. One hundred sentences were randomly selected and at-

17 tachment choices were made for these by a second person to calculate the level of agreement. 82.2% were disambiguated in the same manner resulting in Cohen s κ agreement of.68. A previous study showed 91.3% agreement when disambiguating prepositional phrase attachments in English [Ratnaparkhi et al. (1994)]. Frequency information for English was taken from the written section of the British National Corpus as harvested by Adam Kilgarriff. French frequency was based on the text on the CD-ROM du Monde Diplomatique ( ) as harvested by Jean Véronis. Both frequencies were smoothed using Church-Turing for words that occurred fewer than 10 times. 3.2 Data processing To align the eye-tracking data from the Dundee Corpus with the syntactic information from the French Treebank, all non-letter characters were removed. The remaining letters were converted to lower case. The accents of all characters were also removed, due to inconsistency in the accents between the Treebank and the Dundee Corpus. The parsed sentences were then collected along with a line cross-referencing them with the Dundee Corpus. If there was not an available match for the sentence, but the previous and next sentences had exactly one line between them in the other corpus, then those sentences were assumed to correspond. In each of these cases the correspondence was verified manually. Although the same process was not necessary for the English data, the tagged sentences were treated as a separate corpus and processed the same as the French data to maintain consistency in data format. Each sentence with syntactic information was then checked for prepositional phrase attachments that were syntactically ambiguous. The criteria used for finding ambiguities in English described in Section 3.1 were used. The references to the Dundee Corpus were then returned for those cases of syntactic ambiguity along with whether the attachment was high or low. The head words for each of the verbal nucleus, noun phrase 1, prepositional phrase, and noun phrase 2 were also extracted at this point. For the English sentences, the rules used to find the head words were those used by David Magerman, with the exception that possessive noun phrases were not marked up and thus not treated separately [Collins and Magerman (1995)]. The French head-finding rules used were those developed by Abhishek Arun [Arun (2004)]. In the final step of preprocessing, Pederson, et al. s implementation [Pedersen et al. (2004)] of the Lesk similarity measure for words in the English language was used in conjunction with WordNet version 2.1 [Patwardhan et al. (2003)]. The similarity measure for the head word of NP2 and each of NP1 and the verbal nucleus was calculated.

18 3.3 Data analysis The variables included in the model are as follows: Words: the total number of words in the ambiguous PP. Frequency: the average frequency per word in the ambiguous PP. Characters: the average number of characters per word in the ambiguous PP. Type low: a dummy variable to identify the attachment type of the PP. Head HEAD: dummy variables to identify the head word of the prepositional phrase (eg. head o f ). Ratio: the ratio of the VP head and NP2 head similarity to the NP1 and NP2 head similarity. Sub ject (a j, x y): dummy variables to identify the subject For this experiment, the data were analyzed using multiple regression. The general format for the model is based on method three presented in Lorch and Meyers (1990). There were 5 variables used for the model of this data. The reading time per character, which is the total reading time for the ambiguous prepositional phrase divided by the number of characters in the prepositional phrase, was treated as the dependent variable. The total number of words was also included in the model due to the effect it has on the reading time beyond the number of characters. The per-word frequency, that is the sum of probabilities of the words in the prepositional phrase, based on the smoothed frequency data above, divided by the number of words in the prepositional phrase, was used. Dummy encoding was used for the attachment type and the subject variables. Thus the type variable was 1 if the prepositional phrase attached to the noun phrase and 0 if it attached to the verb phrase. 9 binary variables were used for subject so that each had a unique variable that was 1 and the rest 0, and for the last subject all variables were 0. A sixth variable, the VP-NP ratio, was included in the model of the English data. This variable was created by dividing the Lesk similarity between the verbal nucleus head and the noun phrase 2 head by the Lesk similarity between the noun phrase1 head and the noun phrase 2 head. This measure was not included in the French model due to the lack of access to a French version of WordNet, as well as to the lack of evidence for the Lesk measure s accuracy in French. The Pearson s r correlation was used to check for co-linearity among the real valued variables. The distributions for each of those variables were also checked to verify that they could reasonably be approximated by a normal distribution using a histogram. In the case of the dummy variables, the conditional distribution of time per character was checked for normality at each of the levels. Once the data had been fit to the linear model that minimized the square of the residuals, the validity of the model was checked. To test for highly influential points, Cook s D was calculated and plotted along with a plot of leverage against the standardized residuals. The

19 residuals were plotted against the predicted time per character and the spread of the residuals versus each of the predictors was checked to verify the linearity of the model and the equality of variance assumption. Interactions with the Head variables were not included as it caused too many variable in the model to control for the variance due to Sub ject. The baseline model used is: Time = B s Sub ject + β 0 + β 1 Words+β 2 Characters+β 3 Frequency +B sw Sub ject Words +B sc Sub ject Characters +B s f Sub ject Frequency +ε i j The equation used to model both the French and the English data is: Time = B s Sub ject + β 0 + β 1 Words+β 2 Characters+β 3 Frequency +B h Head + B t Type +B sw Sub ject Words +B sc Sub ject Characters +B s f Sub ject Frequency +B sh Sub ject Head +B st Sub ject Type+ε i j One English models also includes terms for the similarity ratio: β 4 Ratio and B sr Sub ject Ratio. Separate models were fit with transformed data. The following transformations were made: log 10 (Time) was used instead of Time; Words was used instead of Words; and log 10 (Frequency) was used instead of Frequency. Also, each of the ratio variables was scaled to have zero mean and standard variance. No interactions between the within subject predictors were included in these models. This is based on two assumptions. One is that if a predictor is not significant itself, then interactions involving that predictor is not as likely to be significant. The other is that including too many predictors can cause important predictors to show up as insignificant [Howell (1992)]. For completeness, models that include all two way interactions for within subject predictors are included in appendix B. To compare the interaction between attachment type and language, a linear mixed model was used. The model was based on a model in Fox (2002). The equation is as follows with β

20 representing fixed effects and b representing random effects: Time i j = β 1 + β 2 Type i j + β 3 Language i +β 4 Language i Type i j + b i1 + b i2 Type i j + ε i j For this experiment, results were declared as significant at α = All of the modeling and assumption checking was done using R. Any data for which parse data were missing from the French section were assumed to be missing at random with respect to the variables tested here. The other missing data were the Ratio information as described later. These were only left out of models that included the Ratio variable and the baseline models used for evaluating those models. Outliers were determined based on the histogram and quantile-quantile plots of the variables. As there was not a rigorous definition used, separate models were fit to the data that included outliers to check for any changes to the significance.

21 Chapter 4 Results 4.1 Variables used Total reading time The total reading time measure is treated as the dependent variable. The distribution of the reading times in this experiment are similar to those found in other studies. In particular, it is evident from both the histograms [Fig. 4.2] and the quantile - quantile plots [Fig. 4.3] that the distribution is positively skewed. This is to be expected because reading times can not be less than 0. For this experiment, prepositional phrases that were not read are not included. Thus, there are no reading times of 0 recorded. This was done because there were far more 0 reading times than would be expected if these times were treated as part of the general distribution of reading times. On the quantile-quantile plot for the English reading times, there are three points that look like outliers. Two of these reading times are for a particularly long prepositional phrase, 58 words long, for two different subjects. As this phrase was an outlier for the number of words, it was not included. The third is for another subject on another long prepositional phrase, 30 Min. 52 1st Qu. 361 Median 619 Mean 942 3rd Qu Max (a) English Min. 52 1st Qu. 440 Median 740 Mean rd Qu Max (b) French Figure 4.1: Total Reading Time 15

22 Frequency Total Reading Time (a) English Frequency Total Reading Time (b) French Figure 4.2: Histogram of Reading Time per Character

23 Sample Quantiles Theoretical Quantiles (a) English Sample Quantiles Theoretical Quantiles (b) French Figure 4.3: QQ Plot of Reading Time per Character

24 Min st Qu Median 4.00 Mean rd Qu Max (a) English Min st Qu Median 4.00 Mean rd Qu Max (b) French Figure 4.4: Number of Words words. This was not an outlier in the number of words for the phrase, and the other subjects did not have nearly as long reading times, so only the reading time for this subject on this prepositional phrase was excluded as an outlier. Two possible outliers are suggested by the quantile-quantile plot of the French reading times. These reading times are from two subjects on the same 51-word prepositional phrase. This phrase accounts for most of the longest reading time measures. As this phrase was an outlier in the number of words, it was not included in the model Number of words The plots of word length show that the distributions are positively skewed [Figs. 4.5 and 4.6]. This was expected as the minimum number of words in the phrase, by the definition used here, is two. The phrase must include at least a preposition and a word as the head of the noun phrase. The distribution of the number of words differs from reading time in that the mode of the number of words is the minimum. The fact that the number of words in a phrase is a discrete measure means the quantile - quantile plots show data grouped in horizontal lines. For use in the linear models, the number of words is treated as normal, and therefore continuous. The plots of the French data show a gap in the distribution between 40-word phrases 50- word phrases. The 51-word phrase mentioned earlier was the only phrase longer than 40 words and was removed from the data set. The quantile - quantile plot suggests there are phrases that are separated from the distribution. This is also evident in the histogram as the tail is not smooth. As there are multiple phrases in this category, they were not treated as outliers. As in the French data, there is one English phrase that stands out in word length. This phrase is the one mentioned earlier that is 58 words long. Since this is the only phrase in the English data longer than 50 words, it was considered an outlier and not included in the model as stated above.

25 Frequency Number of Words (a) English Frequency Number of Words (b) French Figure 4.5: Histogram of Number of Words

26 Sample Quantiles Theoretical Quantiles (a) English Sample Quantiles Theoretical Quantiles (b) French Figure 4.6: QQ Plot of Number of Words

27 Min. 4.95e-05 Min. 4.91e-04 1st Qu. 2.62e-03 1st Qu. 5.53e-03 Median 5.08e-03 Median 8.41e-03 Mean 5.60e-03 Mean 1.03e-02 3rd Qu. 7.78e-03 3rd Qu. 1.52e-02 Max. 1.60e-02 Max. 2.45e-02 (a) English (b) French Figure 4.7: Average Frequency per Word Average frequency per word The histograms of average frequency per word reveal the least smooth of the distributions of the variables used in this model [Figs. 4.8 and 4.9]. This is most likely due to these measures being dominated by the frequency of the head word of the prepositional phrase. This would explain the multi-modal look of the distributions. Furthermore, the histogram of the French frequencies has a peak at around The phrases that are included in this part are all head word de or á, which are the most common head words [Fig. 4.14(b)]. The inclusion of the preposition in all of the measures, along with smoothing, also induces a minimum value for the frequency per word. This minimum value is the occurrence of the lowest frequency preposition along with a number of unknown words. Despite the multi-modal nature of distribution, it is treated as a normal distribution. The quantile-quantile plot shows that there are also heavier tails than would be expected in a normal distribution. Additionally, the plot is positively skewed. There are no clear outliers from the histograms or the quantile-quantile plots, thus no phrases were excluded based on the frequency-per-word measure Average number of characters per word The distribution of the average number of characters per word is the closest to a normal distribution without transformation [Figs and 4.1.4]. The quantile-quantile plots reveal that the distribution is slightly positively skewed, with heavier tails. These plots also show the same horizontal grouping as the number of words plot. This is due to the characters per word variable being the quotient of two integer valued variables, number of words, and number of characters. Thus, the variable is rational and not continuous. The distribution is still approximated by a normal distribution for the purposes of the linear model. There were no outliers apparent in the English data. One phrase was removed from the French data. The phrase removed was 9.67 characters per word, while the next highest was 8.5.

28 Frequency Average Frequency per Word (a) English Frequency Average Frequency per Word (b) French Figure 4.8: Histogram of Average Frequency per Word

29 Sample Quantiles Theoretical Quantiles (a) English Sample Quantiles Theoretical Quantiles (b) French Figure 4.9: QQ Plot of Average Frequency per Word

30 Min st Qu Median 4.78 Mean rd Qu Max (a) English Min st Qu Median 5.00 Mean rd Qu Max (b) French Figure 4.10: Average Number of Characters per Word Attachment type The percentage of attachment type is fairly similar between the two languages [Figs. 4.14(a) 4.14(b)]. The percentage of low attachment is a bit higher than the previously reported 59% Collins and Brooks (1995), which could be due to differences between British English and American English. Another possibility is that high attachments were more likely missed when tagging the data for the Dundee corpus. This difference was, however, assumed to be due to chance. Previous data were not available on the likelihood of high or low attachment in French, although some data have pointed toward low attachment being more likely Gaussier and Cancedda (2001) Preposition head In each language one preposition head is clearly the most frequently occurring in ambiguous prepositional phrases, de in French and of in English [Fig. 4.14]. It is interesting to note that these prepositions serve roughly the same purpose in their respective languages. Preposition heads are important to include in the models because, without dividing the phrases up according to their heads, a model would make predictions that are heavily biased toward those for the dominant preposition heads Similarity ratio A similarity ratio could not be calculated for quite a few tagged phrases. Often this resulted from a pronoun head word of either the NP1 or the NP2. There is no entry in WordNet for pronouns, and to get a correct similarity measure, the referent of the pronoun would be needed, so these phrases were not included. Infrequently used proper nouns and numbers also led to similarity ratios that could not be calculated. In the models that did not include the similarity ratio, these phrases were still included. In the model that did include the similarity ratio, it was assumed that excluding those phrases did not impact the model s predictions of significance.

31 Frequency Average Number of Characters per Word (a) English Frequency Average Number of Characters per Word (b) French Figure 4.11: Histogram of Average Characters per Word

32 Sample Quantiles Theoretical Quantiles (a) English Sample Quantiles Theoretical Quantiles (b) French Figure 4.12: QQ Plot of Average Characters per Word

33 High % Low % (a) English High % Low % (b) French Figure 4.13: Attachment Type of % in % for % on % to % with % (Other) % (a) English de % a % dans % en % sur % pour % (Other) % (b) French Figure 4.14: Head Word of the Prepositional Phrase Min. 1st Qu. Median Mean 3rd Qu. Max. NA s 8.80e e e e e e e+03 Figure 4.15: Similarity Ratio (English)

34 Frequency Similarity Ratio Figure 4.16: Histogram of Similarity Ratio Sample Quantiles Theoretical Quantiles Figure 4.17: QQ Plot of Similarity Ratio

35 Time Words Characters Frequency Ratio Time Words Characters Frequency Ratio (a) English Time Words Characters Frequency Time Words Characters Frequency (b) French Figure 4.18: Correlations Between Ratio Variables The similarity ratio is positively skewed for the same reason as the other variables [Fig. 4.16]. There can not be any ratio that is less than or equal to zero, although the ratio can theoretically be arbitrarily close to zero. There were two sentences that were considered outliers. The largest similarity ratio was 50.5 for the verb tie and the noun rope versus the noun legs. The next largest is 28, which was for the verb be and the noun activity compared with the noun Hitler. As the next largest was 19.6, those phrases were treated as outliers for the model that included similarity ratios. 4.2 Variable interaction The correlations reported in Fig are inflated since they are not broken down by subject. However, the pattern of results are generally as expected. The positive correlation between the number of words and the reading time is high. This is both intuitive (the longer a phrase, the longer it takes to read), as well as a well-established fact in linguistics. The other correlations that are expected are between average frequency per word, characters per word, and reading times. Words that are shorter are usually more frequent. If words are more frequent, they are also read faster. A phrase that consists of shorter words is generally read faster than a phrase with longer words if both phrase have the same number of words.

36 Estimate Std. Error t value Pr(> t ) Words <2e-16 Characters Frequency Residual standard error: 486 on 3368 degrees of freedom Multiple R-Squared: 0.81, Adjusted R-squared: F-statistic: 368 on 39 and 3368 DF, p-value: <2e-16 Figure 4.19: French baseline model results The box plots for reaction time do not show any clear differences between high attachment and low attachment. There is a slightly lower mean for low attachment in the French data. As mentioned earlier this effect could be simply due to reading time for de because of the proportion of phrases with de as the head. Plots regarding the two way relationships between variables are included in Appendix A. 4.3 Modeling results In all of the results presented, the variables that are related to this paper are reported. This includes the Words, Characters, Frequency, Type, Head, and when available, the Ratio. Additionally, any other variables that are measuring within subject variance and are significant are reported French models The baseline model [Fig. 4.19] shows that the number of words is indeed significant, as is the number of characters per word. Frequency is not significant. This is most likely due to the collinearity between the frequency-per-word measure and the characters-per-word measure. As the two measures are related, the variance is mostly explained without reference to the frequency per word. The model again shows that the number of words is significant. Neither the number of characters per word nor the frequency-per-word are now significant. None of the head word dummy variables are shown as none were significant. The type dummy variable did not turn out to be significant. There is a small improvement over the baseline model of the multiple R 2 statistic. Due to the extra variables, the adjusted R 2 is not higher. Figure 4.22 verifies that there is not a significant improvement over the baseline. Including the outliers in the model does not change which variables are significant [Fig. 4.23]. There is an improvement in the R 2 measure. Using the transformed variables for the model gives similar results [Fig. 4.24]. The Words is still significant. There is a higher t value for characters and a lower value for

37 Estimate Std. Error t value Pr(> t ) Words <2e-16 Characters Frequency Type Low Residual standard error: 487 on 3165 degrees of freedom Multiple R-Squared: 0.821, Adjusted R-squared: F-statistic: 59.8 on 242 and 3165 DF, p-value: <2e-16 Figure 4.20: French full model results Residuals vs Fitted d175 x172g89 Residuals Fitted values lm(time Subj + Words + Chpw + Frqpw + Head + Type + Subj :Words + Subj... Figure 4.21: French plot of residuals against predicted value Res.Df RSS Df Sum of Sq F Pr(>F) e e e Figure 4.22: French full model ANOVA comparison with baseline

38 Estimate Std. Error t value Pr(> t ) Words <2e-16 Characters Frequency Type Low Residual standard error: 495 on 3185 degrees of freedom Multiple R-Squared: 0.841, Adjusted R-squared: F-statistic: 69.4 on 242 and 3185 DF, p-value: <2e-16 Figure 4.23: French full model results including outliers Estimate Std. Error t value Pr(> t ) Words <2e-16 Characters Frequency Head entre Type Low Residual standard error: on 3165 degrees of freedom Multiple R-Squared: 0.664, Adjusted R-squared: F-statistic: 25.8 on 242 and 3165 DF, p-value: <2e-16 Figure 4.24: French transformed full model results Frequency. The type dummy variable has a higher t value as well, although it is still nowhere near significant. Interestingly, one of the head word dummy variables is significant. Also, the R 2 is much lower than the untransformed models. An ANOVA comparison with a transformed version of the baseline model shows that including the type and head word dummy variables results in significant improvement [Fig. 4.25] English models For the English data, the baseline model yields results similar to the results of the French data [Fig. 4.26]. Again Words and Characters are significant while Frequency is not. The intercept is included in the table as it was only significant in this model. The R 2 values are lower for the Res.Df RSS Df Sum of Sq F Pr(>F) Figure 4.25: French transformed model ANOVA comparison with baseline

39 Estimate Std. Error t value Pr(> t ) (Intercept) e-05 Words <2e-16 Characters e-07 Frequency Residual standard error: 462 on 9403 degrees of freedom Multiple R-Squared: 0.784, Adjusted R-squared: F-statistic: 875 on 39 and 9403 DF, p-value: <2e-16 Figure 4.26: English baseline model results Estimate Std. Error t value Pr(> t ) Words 1.39e e <2e-16 Characters 6.71e e e-06 Frequency -4.06e e Type Low 7.95e e Residual standard error: 465 on 9015 degrees of freedom Multiple R-Squared: 0.79, Adjusted R-squared: 0.78 F-statistic: 79.5 on 427 and 9015 DF, p-value: <2e-16 Figure 4.27: English model without ratio results English, but not drastically so. For comparison with the French model, a linear model was fit to the English data without the similarity ratio. The results are shown in figures 4.27 and The Words variable is significant, and the Characters variable stayed significant despite the extra variables. The R 2 results were similar in that the multiple R 2 increased slightly while the adjusted R 2 decreased slightly. Including the similarity ratio in the model resulted in some improvement in the R 2 [Fig. 4.29]. However, neither the similarity ratio did not explain a significant amount of variance. The improvement was not significant [Fig. 4.31]. Using transformed variables resulted in the same changes as in the French data. The number of characters per word had a higher t value and the average frequency per word had a lower t value. The R 2 is also lower for the transformed variables. The ANOVA did not show a Res.Df RSS Df Sum of Sq F Pr(>F) e e e Figure 4.28: English model without ratio ANOVA comparison with baseline

40 Estimate Std. Error t value Pr(> t ) Words 1.33e e <2e-16 Chpw 6.02e e e-05 Frqpw -3.22e e Type Low 2.35e e Ratio 6.27e e Residual standard error: on 7591 degrees of freedom Multiple R-Squared: 0.799, Adjusted R-squared: F-statistic: 73.0 on 414 and 7591 DF, p-value: < 2.2e-16 Figure 4.29: English full model results Residuals vs Fitted h771 Residuals h882 h Fitted values lm(time Subj + Words + Chpw + Frqpw + Head + Type + VPNPratio + Subj :... Figure 4.30: English plot of residuals against predicted value Res.Df RSS Df Sum of Sq F Pr(>F) e e e Figure 4.31: English full model ANOVA comparison with baseline

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate

More information

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION EDUCATION AND VOCABULARY 5-10 hours of input weekly is enough to pick up a new language (Schiff & Myers, 1988). Dutch children spend 5.5 hours/day

More information

Testing for Lack of Fit

Testing for Lack of Fit Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Premaster Statistics Tutorial 4 Full solutions

Premaster Statistics Tutorial 4 Full solutions Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for

More information

Final Exam Practice Problem Answers

Final Exam Practice Problem Answers Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Module 5: Multiple Regression Analysis

Module 5: Multiple Regression Analysis Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

Developing a Stock Price Model Using Investment Valuation Ratios for the Financial Industry Of the Philippine Stock Market

Developing a Stock Price Model Using Investment Valuation Ratios for the Financial Industry Of the Philippine Stock Market Developing a Stock Price Model Using Investment Valuation Ratios for the Financial Industry Of the Philippine Stock Market Tyrone Robin 1, Carlo Canquin 1, Donald Uy 1, Al Rey Villagracia 1 1 Physics Department,

More information

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

More information

Paraphrasing controlled English texts

Paraphrasing controlled English texts Paraphrasing controlled English texts Kaarel Kaljurand Institute of Computational Linguistics, University of Zurich kaljurand@gmail.com Abstract. We discuss paraphrasing controlled English texts, by defining

More information

Comparing Nested Models

Comparing Nested Models Comparing Nested Models ST 430/514 Two models are nested if one model contains all the terms of the other, and at least one additional term. The larger model is the complete (or full) model, and the smaller

More information

5. Linear Regression

5. Linear Regression 5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

More information

Simple maths for keywords

Simple maths for keywords Simple maths for keywords Adam Kilgarriff Lexical Computing Ltd adam@lexmasterclass.com Abstract We present a simple method for identifying keywords of one corpus vs. another. There is no one-sizefits-all

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

More information

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance 2 Making Connections: The Two-Sample t-test, Regression, and ANOVA In theory, there s no difference between theory and practice. In practice, there is. Yogi Berra 1 Statistics courses often teach the two-sample

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

Using R for Linear Regression

Using R for Linear Regression Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional

More information

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this

More information

Multiple Regression: What Is It?

Multiple Regression: What Is It? Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

More information

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

More information

How Far is too Far? Statistical Outlier Detection

How Far is too Far? Statistical Outlier Detection How Far is too Far? Statistical Outlier Detection Steven Walfish President, Statistical Outsourcing Services steven@statisticaloutsourcingservices.com 30-325-329 Outline What is an Outlier, and Why are

More information

Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

More information

individualdifferences

individualdifferences 1 Simple ANalysis Of Variance (ANOVA) Oftentimes we have more than two groups that we want to compare. The purpose of ANOVA is to allow us to compare group means from several independent samples. In general,

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

More information

4. Multiple Regression in Practice

4. Multiple Regression in Practice 30 Multiple Regression in Practice 4. Multiple Regression in Practice The preceding chapters have helped define the broad principles on which regression analysis is based. What features one should look

More information

Analysis of Variance ANOVA

Analysis of Variance ANOVA Analysis of Variance ANOVA Overview We ve used the t -test to compare the means from two independent groups. Now we ve come to the final topic of the course: how to compare means from more than two populations.

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.

More information

1.5 Oneway Analysis of Variance

1.5 Oneway Analysis of Variance Statistics: Rosie Cornish. 200. 1.5 Oneway Analysis of Variance 1 Introduction Oneway analysis of variance (ANOVA) is used to compare several means. This method is often used in scientific or medical experiments

More information

Language Modeling. Chapter 1. 1.1 Introduction

Language Modeling. Chapter 1. 1.1 Introduction Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

More information

Regression 3: Logistic Regression

Regression 3: Logistic Regression Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing

More information

Logistic Regression (a type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge

More information

Statistics 2014 Scoring Guidelines

Statistics 2014 Scoring Guidelines AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home

More information

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance 14 November 2007 1 Confidence intervals and hypothesis testing for linear regression Just as there was

More information

ABSORBENCY OF PAPER TOWELS

ABSORBENCY OF PAPER TOWELS ABSORBENCY OF PAPER TOWELS 15. Brief Version of the Case Study 15.1 Problem Formulation 15.2 Selection of Factors 15.3 Obtaining Random Samples of Paper Towels 15.4 How will the Absorbency be measured?

More information

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

More information

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

More information

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. In the main dialog box, input the dependent variable and several predictors.

More information

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression Correlation and Simple Linear Regression We are often interested in studying the relationship among variables to determine whether they are associated with one another. When we think that changes in a

More information

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

CHI-SQUARE: TESTING FOR GOODNESS OF FIT CHI-SQUARE: TESTING FOR GOODNESS OF FIT In the previous chapter we discussed procedures for fitting a hypothesized function to a set of experimental data points. Such procedures involve minimizing a quantity

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

More information

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1 Stat 5303 (Oehlert): Tukey One Degree of Freedom 1 > catch

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

STAT 350 Practice Final Exam Solution (Spring 2015)

STAT 350 Practice Final Exam Solution (Spring 2015) PART 1: Multiple Choice Questions: 1) A study was conducted to compare five different training programs for improving endurance. Forty subjects were randomly divided into five groups of eight subjects

More information

Association Between Variables

Association Between Variables Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces Or: How I Learned to Stop Worrying and Love the Ball Comment [DP1]: Titles, headings, and figure/table captions

More information

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material

More information

THE KRUSKAL WALLLIS TEST

THE KRUSKAL WALLLIS TEST THE KRUSKAL WALLLIS TEST TEODORA H. MEHOTCHEVA Wednesday, 23 rd April 08 THE KRUSKAL-WALLIS TEST: The non-parametric alternative to ANOVA: testing for difference between several independent groups 2 NON

More information

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing 1 Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing Lourdes Araujo Dpto. Sistemas Informáticos y Programación, Univ. Complutense, Madrid 28040, SPAIN (email: lurdes@sip.ucm.es)

More information

Week 5: Multiple Linear Regression

Week 5: Multiple Linear Regression BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School

More information

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,

More information

CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont

CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont To most people studying statistics a contingency table is a contingency table. We tend to forget, if we ever knew, that contingency

More information

2. Linear regression with multiple regressors

2. Linear regression with multiple regressors 2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions

More information

International Statistical Institute, 56th Session, 2007: Phil Everson

International Statistical Institute, 56th Session, 2007: Phil Everson Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: peverso1@swarthmore.edu 1. Introduction

More information

MULTIPLE REGRESSION WITH CATEGORICAL DATA

MULTIPLE REGRESSION WITH CATEGORICAL DATA DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 86 MULTIPLE REGRESSION WITH CATEGORICAL DATA I. AGENDA: A. Multiple regression with categorical variables. Coding schemes. Interpreting

More information

Lecture Notes Module 1

Lecture Notes Module 1 Lecture Notes Module 1 Study Populations A study population is a clearly defined collection of people, animals, plants, or objects. In psychological research, a study population usually consists of a specific

More information

3. Mathematical Induction

3. Mathematical Induction 3. MATHEMATICAL INDUCTION 83 3. Mathematical Induction 3.1. First Principle of Mathematical Induction. Let P (n) be a predicate with domain of discourse (over) the natural numbers N = {0, 1,,...}. If (1)

More information

Predicting Box Office Success: Do Critical Reviews Really Matter? By: Alec Kennedy Introduction: Information economics looks at the importance of

Predicting Box Office Success: Do Critical Reviews Really Matter? By: Alec Kennedy Introduction: Information economics looks at the importance of Predicting Box Office Success: Do Critical Reviews Really Matter? By: Alec Kennedy Introduction: Information economics looks at the importance of information in economic decisionmaking. Consumers that

More information

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang Sense-Tagging Verbs in English and Chinese Hoa Trang Dang Department of Computer and Information Sciences University of Pennsylvania htd@linc.cis.upenn.edu October 30, 2003 Outline English sense-tagging

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

Lucky vs. Unlucky Teams in Sports

Lucky vs. Unlucky Teams in Sports Lucky vs. Unlucky Teams in Sports Introduction Assuming gambling odds give true probabilities, one can classify a team as having been lucky or unlucky so far. Do results of matches between lucky and unlucky

More information

Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015

Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015 Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015 Stata Example (See appendices for full example).. use http://www.nd.edu/~rwilliam/stats2/statafiles/multicoll.dta,

More information

Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests

Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests Final Report Sarah Maughan Ben Styles Yin Lin Catherine Kirkup September 29 Partial Estimates of Reliability:

More information

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA) UNDERSTANDING ANALYSIS OF COVARIANCE () In general, research is conducted for the purpose of explaining the effects of the independent variable on the dependent variable, and the purpose of research design

More information

UNDERSTANDING THE TWO-WAY ANOVA

UNDERSTANDING THE TWO-WAY ANOVA UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables

More information

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde Statistical Verb-Clustering Model soft clustering: Verbs may belong to several clusters trained on verb-argument tuples clusters together verbs with similar subcategorization and selectional restriction

More information

Author Gender Identification of English Novels

Author Gender Identification of English Novels Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

SPSS Guide: Regression Analysis

SPSS Guide: Regression Analysis SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar

More information

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),

More information

Improving SAS Global Forum Papers

Improving SAS Global Forum Papers Paper 3343-2015 Improving SAS Global Forum Papers Vijay Singh, Pankush Kalgotra, Goutam Chakraborty, Oklahoma State University, OK, US ABSTRACT Just as research is built on existing research, the references

More information

Inference for two Population Means

Inference for two Population Means Inference for two Population Means Bret Hanlon and Bret Larget Department of Statistics University of Wisconsin Madison October 27 November 1, 2011 Two Population Means 1 / 65 Case Study Case Study Example

More information

ANOVA. February 12, 2015

ANOVA. February 12, 2015 ANOVA February 12, 2015 1 ANOVA models Last time, we discussed the use of categorical variables in multivariate regression. Often, these are encoded as indicator columns in the design matrix. In [1]: %%R

More information

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2 University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

n + n log(2π) + n log(rss/n)

n + n log(2π) + n log(rss/n) There is a discrepancy in R output from the functions step, AIC, and BIC over how to compute the AIC. The discrepancy is not very important, because it involves a difference of a constant factor that cancels

More information

Keywords academic writing phraseology dissertations online support international students

Keywords academic writing phraseology dissertations online support international students Phrasebank: a University-wide Online Writing Resource John Morley, Director of Academic Support Programmes, School of Languages, Linguistics and Cultures, The University of Manchester Summary A salient

More information

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Thai Language Self Assessment

Thai Language Self Assessment The following are can do statements in four skills: Listening, Speaking, Reading and Writing. Put a in front of each description that applies to your current Thai proficiency (.i.e. what you can do with

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation

More information

Cohesive writing 1. Conjunction: linking words What is cohesive writing?

Cohesive writing 1. Conjunction: linking words What is cohesive writing? Cohesive writing 1. Conjunction: linking words What is cohesive writing? Cohesive writing is writing which holds together well. It is easy to follow because it uses language effectively to guide the reader.

More information