Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms ESSLLI 2015 Barcelona, Spain http://ufal.mff.cuni.cz/esslli2015 Barbora Hladká hladka@ufal.mff.cuni.cz Martin Holub holub@ufal.mff.cuni.cz Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics ESSLLI 2015 Hladká & Holub Day 2, page 1/38
Block 2.1 Data analysis (cntnd) Motivation No. 1 We, as students of English, want to understand the following sentences properly He broke down and cried when we talked to him about it. Major cried, jabbing a finger in the direction of one heckler. If we are not sure, we check definitions for the verb cry in a dictionary ESSLLI 2015 Hladká & Holub Day 2, page 2/38
Verb Patterns Recognition CRY -- dictionary definitions ESSLLI 2015 Hladká & Holub Day 2, page 3/38
Verb Patterns Recognition Based on the explanation and the examples of usage, we can recognize the two meanings of cry in the sentences He broke down and cried when we talked to him about it. [1] Major cried, jabbing a finger in the direction of one heckler. [2] ESSLLI 2015 Hladká & Holub Day 2, page 4/38
Verb Patterns Recognition Motivation No. 2 We, as developers of natural language application, need to recognize verb meanings automatically. Verb Patterns Recognition task (VPR) is the computational linguistic task of lexical disambiguation of verbs a lexicon consists of verb usage patterns that correspond to dictionary definitions disambiguation is recognition of the verb usage pattern in a given sentence ESSLLI 2015 Hladká & Holub Day 2, page 5/38
VPR Verb patterns CRY -- Pattern definitions Pattern 1 Explanation Example [Human] cry [no object] [[Human]] weeps usually because [[Human]] is unhappy or in pain His advice to stressful women was: ` If you cry, do n't cry alone. Pattern 4 Explanation Example [Human] cry [THAT-CL WH-CL QUOTE] ({out}) [[Human]] shouts ([QUOTE]) loudly typically, in order to attract attention You can hear them screaming and banging their heads, crying that they want to go home. Pattern 7 Explanation Example [Entity State] cry [{out}] [{for} Action] [no object] [[Entity State]] requires [[Action]] to be taken urgently Identifying areas which cry out for improvement or even simply areas of muddle and misunderstanding, is by no means negative -- rather a spur to action. E.g., the pattern 1 of cry consists of a subject that is supposed to be a Human and of no object. ESSLLI 2015 Hladká & Holub Day 2, page 6/38
VPR Getting examples Examples for the VPR task are the output of annotation. 1 Choosing verbs you are interested in > cry, submit 2 Defining their patterns 3 Collecting sentences with the chosen verbs ESSLLI 2015 Hladká & Holub Day 2, page 7/38
VPR Getting examples 4 Annotating the sentences assign a pattern that fits best the given sentence if you think that no pattern matches the sentence, choose "u" if you do not think that the given word is a verb, choose "x" ESSLLI 2015 Hladká & Holub Day 2, page 8/38
VPR Data Basic statistics CRY SUBMIT instances 250 250 classes 1 4 7 u x u 1 2 4 5 frequency 131 59 13 33 14 7 177 33 12 21 ESSLLI 2015 Hladká & Holub Day 2, page 9/38
VPR Data representation instance feature target id vector pattern morphological morpho-syntactic morpho-syntactic semantic feature feature feature feature family family family familiy (MS) (STA) (MST) (SEM) 129825 0 0 0 0 1................................. 0 0 0 0 7............ For more details, see vpr.handout posted at the course webpage. ESSLLI 2015 Hladká & Holub Day 2, page 10/38
VPR Feature extraction He broke down and cried when we talked to him about it. MF_tense_vbd 1 verb past tense OK MF_3p_verbs 1 third word preceding the verb is verb broke, OK MF_3n_verbs 1 third word following the verb is verb talked, OK......... STA.LEX_prt_none 1 there is no particle dependent on the verb OK STA.LEX_prep_none 1 there is no preposition dependent on the verb OK......... MST.GEN_n_subj 1 nominal subject of the verb OK......... SEM.s.ac 1 verb s subject is Abstract he, KO......... tp 1 true target pattern ESSLLI 2015 Hladká & Holub Day 2, page 11/38
VPR Details on annotation Annotation by 1 expert and 3 annotators verb target number of baseline avg human perplexity kappa classes instances (%) accuracy (%) 2 H(P) CRY 1,4,7,u,x 250 52,4 92,2 3,5 0,84 SUBMIT 1,2,4,5,u 250 70,8 94,1 2,6 0,88 baseline is accuracy of the most frequent classifier avg human accuracy is average accuracy of 3 annotators with respect to the expert s annotation perplexity of a target class kappa is Fleiss kappa of inter-annotator agreement ESSLLI 2015 Hladká & Holub Day 2, page 12/38
Questions? ESSLLI 2015 Hladká & Holub Day 2, page 13/38
Data analysis (cntnd) Deeper understanding the task by statistical view on the data We exploit the data in order to make prediction of the target value. Build intuition and understanding for both the task and the data Ask questions and search for answers in the data What values do we see What associations do we see Do plotting and summarizing ESSLLI 2015 Hladká & Holub Day 2, page 14/38
Analyzing distributions of values Feature frequency Feature frequency fr(a j ) = #{x i x j i > 0} where A j is the j-th feature, x i is the feature vector of the i-th instance, and x j i is the value of A j in x i. ESSLLI 2015 Hladká & Holub Day 2, page 15/38
Analyzing distributions of values Feature frequency > examples <- read.csv("cry.development.csv", sep="\t") > c <- examples[,-c(1,ncol(examples))] > length(names(c)) # get the number of features [1] 363 # compute feature frequencies using the fr function > ff <- apply(c, 2, fr.feature) > table(sort(ff)) 0 1 2 3 4 5 6 7 8 9 10 12 14 15 16 20 181 47 26 12 9 3 5 6 4 4 7 1 3 1 2 1 21 24 25 26 28 29 30 31 32 34 35 39 41 42 46 48 49 3 1 1 2 1 1 3 5 2 2 1 1 1 1 1 3 1 51 55 64 65 77 82 89 92 98 138 151 176 181 217 218 245 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 247 248 249 1 1 2 ESSLLI 2015 Hladká & Holub Day 2, page 16/38
Analyzing distributions of values Feature frequency VPR task: cry (feature frequency cry.r) fr(a) 0 50 100 200 features ESSLLI 2015 Hladká & Holub Day 2, page 17/38
Analyzing distributions of values Feature frequency VPR task: cry (feature frequency cry.r) fr(a) > 50 50 100 150 200 250 STA.GEN_any_obj MF.2p_verbs MF.tense_vbg MF.tense_vbd MF.3n_nominal MF.2p_nominal MF.3p_nominal MF.2n_nominal MF.1p_nominal MF.1n_adverbial MST.GEN_n_subj STA.GEN_n_subj MST.LEX_prep_none STA.LEX_prep_none STA.LEX_prt_none MST.LEX_prt_none STA.LEX_mark_none STA.GEN_c_subj STA.LEX_prepc_none MST.LEX_prepc_none MST.LEX_mark_none ESSLLI 2015 Hladká & Holub Day 2, page 18/38
Analyzing distributions of values Entropy # compute entropy using the entropy function > e <- apply(c, 2, entropy) > table(sort(round(e,2)) 0 0.04 0.07 0.09 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.28 0.31 0.33 181 49 27 13 9 4 5 6 4 4 7 1 3 1 0.34 0.4 0.42 0.46 0.47 0.48 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 2 1 3 1 1 2 1 1 3 5 3 1 2 1 0.62 0.64 0.65 0.69 0.71 0.73 0.76 0.82 0.83 0.85 0.88 0.89 0.91 0.94 1 1 1 1 4 1 1 1 1 1 1 1 1 1 0.95 0.97 0.99 1 3 1 ESSLLI 2015 Hladká & Holub Day 2, page 19/38
Analyzing distributions of values Entropy H(A) 0.0 0.2 0.4 0.6 0.8 1.0 VPR task: cry (entropy cry.r) features ESSLLI 2015 Hladká & Holub Day 2, page 20/38
Analyzing distributions of values Entropy VPR task: cry (entropy cry.r) H(A) > 0.5 0.5 0.6 0.7 0.8 0.9 MST.GEN_infinitive MF.3p_verbs STA.GEN_advcl MF.1p_verbs STA.GEN_dobj MST.GEN_subj_pl MF.3n_verbs STA.GEN_infinitive SEM.s.H MST.GEN_any_obj MF.tense_vb STA.GEN_any_obj SEM.s.ac STA.LEX_prep_none MF.2p_verbs MST.GEN_advmod STA.GEN_advmod MF.tense_vbd MST.LEX_prep_none MF.tense_vbg MF.1n_adverbial MF.3n_nominal MF.2p_nominal STA.GEN_n_subj MF.2n_nominal MST.GEN_n_subj MF.3p_nominal MF.1p_nominal ESSLLI 2015 Hladká & Holub Day 2, page 21/38
Association between feature and target value Pearson contingency coefficient VPR task: cry (pearson contingency coefficient vpr.r) pcc(a,p)>0.3 0.3 0.5 0.7 MF.2n_nominal MF.1p_verbs MST.LEX_prep_to SEM.s.act MF.3p_adjective MF.tense_vb MF.3n_adjective MST.LEX_prep_in STA.LEX_prep_in MF.2n_verbs MST.GEN_dobj MF.1n_nominal MF.3n_nominal SEM.s.domain SEM.s.inst MF.1p_be MF.1p_wh.pronoun STA.LEX_prep_to STA.GEN_ccomp MST.GEN_n_subj MST.GEN_any_obj STA.GEN_n_subj STA.GEN_any_obj MF.2p_verbs SEM.s.sem MF.tense_vbg MF.2n_to SEM.o.geom SEM.s.L MF.tense_vbd MF.1n_adverbial MST.LEX_prep_none STA.LEX_prep_none MST.LEX_prt_none MST.LEX_prt_out MF.2n_adverbial STA.LEX_prt_none STA.LEX_prt_out MST.LEX_prep_for STA.LEX_prep_for ESSLLI 2015 Hladká & Holub Day 2, page 22/38
Association between feature and target value Conditional entropy ESSLLI 2015 Hladká & Holub Day 2, page 23/38
Association between feature and target value Conditional entropy # compute conditional entropy using the entropy.cond function ce <- apply(c, 2, entropy.cond, y=examples$tp) table(sort(round(ce,2)) 0 0.04 0.07 0.09 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.28 0.31 0.33 181 49 27 13 9 4 5 6 4 4 7 1 3 1 0.34 0.4 0.42 0.46 0.47 0.48 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 2 1 3 1 1 2 1 1 3 5 3 1 2 1 0.62 0.64 0.65 0.69 0.71 0.73 0.76 0.82 0.83 0.85 0.88 0.89 0.91 0.94 1 1 1 1 4 1 1 1 1 1 1 1 1 1 0.95 0.97 0.99 1 3 1 ESSLLI 2015 Hladká & Holub Day 2, page 24/38
Association between feature and target value Conditional entropy VPR task: cry (entropy cry.r) H(P A) 0.0 0.2 0.4 0.6 0.8 features ESSLLI 2015 Hladká & Holub Day 2, page 25/38
Association between feature and target value Conditional entropy VPR task: cry (entropy cry.r) H(P A) > 0.5 0.5 0.6 0.7 0.8 0.9 MST.GEN_infinitive MF.3p_verbs STA.GEN_advcl MF.1p_verbs STA.GEN_dobj MST.GEN_subj_pl MF.3n_verbs STA.GEN_infinitive SEM.s.H MST.GEN_any_obj MF.tense_vb STA.GEN_any_obj SEM.s.ac STA.LEX_prep_none MF.2p_verbs MST.GEN_advmod STA.GEN_advmod MF.tense_vbd MST.LEX_prep_none MF.tense_vbg MF.1n_adverbial MF.3n_nominal MF.2p_nominal STA.GEN_n_subj MF.2n_nominal MST.GEN_n_subj MF.3p_nominal MF.1p_nominal ESSLLI 2015 Hladká & Holub Day 2, page 26/38
What values do we see Analyzing distributions of values Filter out uneffective features from the CRY data > examples <- read.csv("cry.development.csv", sep="\t") > n <- nrow(examples) > ## remove id and target class tp > c.0 <- examples[,-c(1,ncol(examples))] > ## remove features with 0s only > c.1 <- c.0[,colsums(as.matrix(sapply(c.0, as.numeric)))!= 0] > ## remove features with 1s only > c.2 <- c.1[,colsums(as.matrix(sapply(c.1, as.numeric)))!= n] > ## remove column duplicates > c <- data.frame(t(unique(t(as.matrix(c.2))))) > ncol(c.0) # get the number of input features [1] 363 > ncol(c) # get the number of effective features [1] 168 ESSLLI 2015 Hladká & Holub Day 2, page 27/38
Methods for basic data exploration Confusion matrix Confusion matrices are contingency tables that display results of classification algorithms/annotations. They enables to perform error/difference analysis. Example Two annotators A 1 and A 2 annotated 50 sentences with cry. A 1 A 2 1 4 7 u x 1 24 3 1 3 0 4 3 3 0 1 1 7 0 2 4 0 1 u 1 0 0 0 0 x 0 1 0 0 2 ESSLLI 2015 Hladká & Holub Day 2, page 28/38
What agreement would be reached by chance? Example 1 Assume two annotators (A 1, A 2 ), two classes (t 1, t 2 ), and the following distribution: Then the best possible agreement is t 1 t 2 A 1 50 % 50 % A 2 50 % 50 % ESSLLI 2015 Hladká & Holub Day 2, page 29/38
What agreement would be reached by chance? Example 1 Assume two annotators (A 1, A 2 ), two classes (t 1, t 2 ), and the following distribution: Then t 1 t 2 A 1 50 % 50 % A 2 50 % 50 % the best possible agreement is 100 % the worst possible agreement is ESSLLI 2015 Hladká & Holub Day 2, page 29/38
What agreement would be reached by chance? Example 1 Assume two annotators (A 1, A 2 ), two classes (t 1, t 2 ), and the following distribution: Then t 1 t 2 A 1 50 % 50 % A 2 50 % 50 % the best possible agreement is 100 % the worst possible agreement is 0 % the agreement-by-chance would be ESSLLI 2015 Hladká & Holub Day 2, page 29/38
What agreement would be reached by chance? Example 1 Assume two annotators (A 1, A 2 ), two classes (t 1, t 2 ), and the following distribution: Then t 1 t 2 A 1 50 % 50 % A 2 50 % 50 % the best possible agreement is 100 % the worst possible agreement is 0 % the agreement-by-chance would be 50 % ESSLLI 2015 Hladká & Holub Day 2, page 29/38
What agreement would be reached by chance? Example 2 Assume two annotators (A 1, A 2 ), two classes (t 1, t 2 ), and the following distribution: Then the best possible agreement is t 1 t 2 A 1 90 % 10 % A 2 90 % 10 % ESSLLI 2015 Hladká & Holub Day 2, page 30/38
What agreement would be reached by chance? Example 2 Assume two annotators (A 1, A 2 ), two classes (t 1, t 2 ), and the following distribution: Then t 1 t 2 A 1 90 % 10 % A 2 90 % 10 % the best possible agreement is 100 % the worst possible agreement is ESSLLI 2015 Hladká & Holub Day 2, page 30/38
What agreement would be reached by chance? Example 2 Assume two annotators (A 1, A 2 ), two classes (t 1, t 2 ), and the following distribution: Then t 1 t 2 A 1 90 % 10 % A 2 90 % 10 % the best possible agreement is 100 % the worst possible agreement is 80 % the agreement-by-chance would be ESSLLI 2015 Hladká & Holub Day 2, page 30/38
What agreement would be reached by chance? Example 2 Assume two annotators (A 1, A 2 ), two classes (t 1, t 2 ), and the following distribution: Then t 1 t 2 A 1 90 % 10 % A 2 90 % 10 % the best possible agreement is 100 % the worst possible agreement is 80 % the agreement-by-chance would be 82 % ESSLLI 2015 Hladká & Holub Day 2, page 30/38
What agreement would be reached by chance? Example 3 Assume two annotators (A 1, A 2 ), two classes (t 1, t 2 ), and the following distribution: Then the best possible agreement is t 1 t 2 A 1 90 % 10 % A 2 80 % 20 % ESSLLI 2015 Hladká & Holub Day 2, page 31/38
What agreement would be reached by chance? Example 3 Assume two annotators (A 1, A 2 ), two classes (t 1, t 2 ), and the following distribution: Then t 1 t 2 A 1 90 % 10 % A 2 80 % 20 % the best possible agreement is 90 % the worst possible agreement is ESSLLI 2015 Hladká & Holub Day 2, page 31/38
What agreement would be reached by chance? Example 3 Assume two annotators (A 1, A 2 ), two classes (t 1, t 2 ), and the following distribution: Then t 1 t 2 A 1 90 % 10 % A 2 80 % 20 % the best possible agreement is 90 % the worst possible agreement is 70 % the agreement-by-chance would be ESSLLI 2015 Hladká & Holub Day 2, page 31/38
What agreement would be reached by chance? Example 3 Assume two annotators (A 1, A 2 ), two classes (t 1, t 2 ), and the following distribution: Then t 1 t 2 A 1 90 % 10 % A 2 80 % 20 % the best possible agreement is 90 % the worst possible agreement is 70 % the agreement-by-chance would be 74 % ESSLLI 2015 Hladká & Holub Day 2, page 31/38
Example in R The situation from Example 3 can be simulated in R # N will be the sample size > N = 10^6 # two annotators will annotate randomly > A1 = sample(c(rep(1, 0.9*N), rep(0, 0.1*N))) > A2 = sample(c(rep(1, 0.8*N), rep(0, 0.2*N))) # percentage of their observed agreement > mean(a1 == A2) [1] 0.740112 # exact calculation -- just for comparison > 0.9*0.8 + 0.1*0.2 [1] 0.74 ESSLLI 2015 Hladká & Holub Day 2, page 32/38
Cohen s kappa Cohen s kappa was introduced by Jacob Cohen in 1960. κ = Pr(a) Pr(e) 1 Pr(e) Pr(a) is the relative observed agreement among annotators = percentage of agreements in the sample Pr(e) is the hypothetical probability of chance agreement = probability of their agreement if they annotated randomly κ > 0 if the proportion of agreement obtained exceeds the proportion of agreement expected by chance Limitations Cohen s kappa measures agreement between two annotators only for more annotators you should use Fleiss kappa see http://en.wikipedia.org/wiki/fleiss _kappa ESSLLI 2015 Hladká & Holub Day 2, page 33/38
Cohen s kappa A 1 A 2 1 4 7 u x 1 24 3 1 3 0 4 3 3 0 1 1 7 0 2 4 0 1 u 1 0 0 0 0 x 0 1 0 0 2 Cohen s kappa:? ESSLLI 2015 Hladká & Holub Day 2, page 34/38
Homework 2.1 Work with the SUBMIT data 1 Filter out uneffective features from the data using the filtering rules that we applied to the CRY data. 2 Draw a plot of the conditional entropy H(P A) for the effective features. Then focus on the features for which H(P A) 0.5. Comment what you see on the plots. ESSLLI 2015 Hladká & Holub Day 2, page 35/38
VPR vs. MOV comparison MOV VPR type of task regression classification getting examples by collecting annotation # of examples 100,000 250 # of features 32 363 categorical/binary 29/18 0/363 numerical 3 0 output values 1 5 5 discrete categories ESSLLI 2015 Hladká & Holub Day 2, page 36/38
Block 2.2 Introductory remarks on VPR classifiers ESSLLI 2015 Hladká & Holub Day 2, page 37/38
Example Decision Tree classifier cry Trained using a cross-validation fold ESSLLI 2015 Hladká & Holub Day 2, page 38/38