Distributional Similarity

Size: px
Start display at page:

Download "Distributional Similarity"

Transcription

1 Overview and alpage inria & paris vii Séminaire cental Vendredi 5 novembre 2010

2 Outline 1 2 Similarity Context Dimensionality reduction Latent semantic analysis Non-negative matrix factorization Tensors 3

3 Semantic similarity Most work on semantic similarity relies on the distributional hypothesis (Harris 1954) Take a word and its contexts: tasty tnassiorc greasy tnassiorc tnassiorc with butter tnassiorc for breakfast By looking at a word s context, one can infer its meaning

4 Semantic similarity Most work on semantic similarity relies on the distributional hypothesis (Harris 1954) Take a word and its contexts: tasty tnassiorc greasy tnassiorc tnassiorc with butter tnassiorc for breakfast FOOD By looking at a word s context, one can infer its meaning

5 Semantic similarity Most work on semantic similarity relies on the distributional hypothesis (Harris 1954) Take a word and its contexts: tasty tnassiorc greasy tnassiorc tnassiorc with butter tnassiorc for breakfast By looking at a word s context, one can infer its meaning

6 Matrix Similarity Context Dimensionality reduction Tensors Capture co-occurrence frequencies of two entities

7 Matrix Similarity Context Dimensionality reduction Tensors Capture co-occurrence frequencies of two entities rouge délicieux rapide d occasion pomme vin voiture camion

8 Matrix Similarity Context Dimensionality reduction Tensors Capture co-occurrence frequencies of two entities rouge délicieux rapide d occasion pomme vin voiture camion

9 Matrix Similarity Context Dimensionality reduction Tensors Capture co-occurrence frequencies of two entities rouge délicieux rapide d occasion pomme vin voiture camion

10 Matrix Similarity Context Dimensionality reduction Tensors Capture co-occurrence frequencies of two entities rouge délicieux rapide d occasion pomme vin voiture camion

11 Similarity calculation Similarity Context Dimensionality reduction Tensors Cosine P x y = n i=1 x i y i Pn P n i=1 x2 i i=1 y i 2 Examples: cos(pomme, vin) =.96 cos(pomme, voiture) =.42 cos( x, y ) = x y Other possibilities: set-theoretic measures Dice Jaccard probabilistic measures Kullback-Leibler divergence Jensen-Shannon divergence

12 Different kinds of context Similarity Context Dimensionality reduction Tensors Three different word space models based on context: document-based model (nouns documents) window-based model (nouns context words) syntax-based model (nouns dependency relations) Each model with plethora of parameters! document size, window size, type of dependency relations weighting function ± dimensionality reduction

13 Similarity Context Dimensionality reduction Tensors Different kinds of semantic similarity tight, synonym-like similarity: (near-)synonymous or (co-)hyponymous loosely related, topical similarity: more loose relationships, such as association and meronymy

14 Similarity Context Dimensionality reduction Tensors Different kinds of semantic similarity tight, synonym-like similarity: (near-)synonymous or (co-)hyponymous loosely related, topical similarity: more loose relationships, such as association and meronymy Example médecin doctor : docteur doctor, médecin de famille family doctor, chirurgien surgeon, spécialiste specialist, dermatologue dermatologist, gynécologue gynaecologist médecin doctor : malade patient, maladie disease, diagnostic diagnosis, traitement treatment, hôpital hospital, stéthoscope stethoscope

15 Relation context similarity Similarity Context Dimensionality reduction Tensors Different context leads to different kind of similarity Syntax, small window large window, documents The former models induce tight, synonymous similarity The latter models induce topical relatedness

16 Relation context similarity Similarity Context Dimensionality reduction Tensors Different context leads to different kind of similarity Syntax, small window large window, documents The former models induce tight, synonymous similarity The latter models induce topical relatedness Evaluation Syntax-based model scores best when evaluated according to Wordnet similarity measures (cornetto) Large window and document-based do not score well on Wordnet similarity, but do score on Wordnet domain evaluation

17 Similarity Context Dimensionality reduction Tensors Two reasons for performing dimensionality reduction: Intractable computations When number of elements and number of features is too large, similarity computations may become intractable reduction of the number of features makes computation tractable again Generalization capacity the dimensionality reduction is able to describe the data better, or is able to capture intrinsic semantic features dimensionality reduction is able to improve the results (counter data sparseness and noise)

18 Similarity Context Dimensionality reduction Tensors Latent semantic analysis: introduction Application of a mathematical/statistical technique to simulate how humans learn the semantics of words LSA finds latent semantic dimensions according to which words and documents can be identified Goal: counter data sparseness and get rid of noise

19 Similarity Context Dimensionality reduction Tensors Latent semantic analysis: introduction What is latent semantic analysis technically speaking? The application of singular value decomposition to a term-document matrix to improve similarity calculations

20 Singular value decomposition Similarity Context Dimensionality reduction Tensors Find the dimensions that explain most variance by solving number of eigenvector problems Only keep the n most important dimensions (n = )

21 Similarity Context Dimensionality reduction Tensors SVD: three matrices

22 LSA: Example Similarity Context Dimensionality reduction Tensors LSA in (part of) Twente Nieuws Corpus: 10 years of Dutch newspaper texts (AD, NRC, TR, VK, PAR) terms = nouns documents = paragraphs 20,000 terms * 2,000,000 documents matrix reduced to 300 dimensions

23 LSA: criticism Similarity Context Dimensionality reduction Tensors LSA criticized for a number of reasons: dimensionality reduction is best fit in least-square sense, but this is only valid for normally distributed data; language data is not normally distributed Dimensions may contain negative values; it is not clear what negativity on a semantic scale should designate Shortcomings are remedied by subsequent techniques (PLSA, LDA, NMF,... )

24 Similarity Context Dimensionality reduction Tensors Non-negative matrix factorization: technique Given a non-negative matrix V, find non-negative matrix factors W and H such that: Choosing r n, m reduces data V nxm W nxr H rxm (1) Constraint on factorization: all values in three matrices need to be non-negative values ( 0) Constraint brings about a parts-based representation: only additive, no subtractive relations are allowed

25 Similarity Context Dimensionality reduction Tensors Non-negative matrix factorization: technique Different kinds of NMF s that minimize different cost functions: Square of Euclidean distance Kullback-Leibler divergence better suited for language phenomena To find NMF is to minimize D(V WH) with respect to W and H, subject to the constraints W, H 0 This can be done with update rules H aµ H aµ i W V iµ ia (WH) iµ k W ka W ia W ia µ H aµ V iµ (WH) iµ v H av (2) these update rules converge to a local minimum in the minimization of KL divergence

26 NMF: graphical representation Similarity Context Dimensionality reduction Tensors

27 NMF: results Similarity Context Dimensionality reduction Tensors Context vectors (5k nouns 2k co-occurring nouns) extracted from clef corpus nmf is able to capture clear semantic dimensions Examples: bus bus, taxi taxi, trein train, halte stop, reiziger traveler, perron platform, tram tram, station station, chauffeur driver, passagier passenger bouillon broth, slagroom cream, ui onion, eierdooier egg yolk, laurierblad bay leaf, zout salt, deciliter decilitre, boter butter, bleekselderij celery, saus sauce

28 Two-way vs. three-way Similarity Context Dimensionality reduction Tensors all methods use two way co-occurrence frequencies matrix suitable for two-way problems words documents nouns dependency relations not suitable for n-way problems words documents authors verbs subjects direct objects

29 Two-way vs. three-way Similarity Context Dimensionality reduction Tensors all methods use two way co-occurrence frequencies matrix suitable for two-way problems words documents nouns dependency relations not suitable for n-way problems tensor words documents authors verbs subjects direct objects

30 Similarity Context Dimensionality reduction Tensors Non-negative tensor factorization: technique Idea similar to non-negative matrix factorization Calculations are different min xi R D1 0,y i R D2 0,z i R D3 T k 0 i=1 x i y i z i 2 F

31 NTF: graphical representation Similarity Context Dimensionality reduction Tensors

32 Task: automatic extraction of multi-word expressions (mwes) from large corpora Starting point: many mwes are non-compositional, i.e. the meaning of the mwe is not the sum of the meaning of the individual words Intuition: a noun within a mwe cannot easily be replaced by a semantically similar noun Use of semantic clusters to determine whether a mwe-candidate is compositional or not

33 Intuition break the ice break the vase *snow cup *hail dish In the first expression, it is not possible to replace ice with semantically related nouns such as snow or hail; in the second expression, it is possible to replace vase with semantically related words such as cup or dish.

34 Overview of the method verb + prepositional complement instances are extracted from 500M word corpus (focus on mwe with PP) matrix of 5K verb-preposition combinations 10K nouns is created 10K nouns are automatically clustered using distributional similarity measures number of statistical measures is applied to determine unique associations given the cluster in which a noun appears

35 Measures inspired by selectional preferences (Resnik 1993), entropy-based Kullback-Leibler divergence between P(n) and P(n v) S v = n p(n v) p(n v) log p(n) (3)

36 Measures The preference of the verb for the noun [0, 1] A v n = p(n v) log p(n v) p(n) S v (4) Ratio of verb preference for a particular noun, compared to other nouns in the cluster [0, 1] R v n = A v n n ɛc A v n (5)

37 Measures The preference of the noun for the verb [0, 1] A n v = p(v n) log p(v n) p(v) S n (6) Ratio of noun preference for a particular verb, compared to other nouns in the cluster [0, 1] R n v = A n v n ɛc A n v (7)

38 An elaborated example in de smaak vallen in de put vallen *geur kuil *voorkeur krater *stijl greppel in the taste fall in the well fall to be appreciated to fall down the well *smell hole *preference crater *style trench

39 An elaborated example smaak: idioom, karakter, persoonlijkheid, stijl, temperament, thematiek, uiterlijk, uitstraling, voorkomen mwe candidate (1) (2) (3) (4) mwe? val#in smaak yes val#in karakter no val#in stijl no

40 An elaborated example put: gaatje, gat, kloof, krater, kuil, lek, scheur, valkuil mwe candidate (1) (2) (3) (4) mwe? val#in put no val#in kuil no val#in kloof no

41 Quantitative Evaluation Fully automated, compared to Referentie Bestand Nederlands Upper bound consists of all RBN mwe s present in the data Parameters Prec Rec F-Measure (1) (2) (3) (4) N (%) (%) (%) Fazly/Stevenson Random baseline

42 Conclusion Non-compositionality based algorithm is able to rule out expression that are coined mwe s by traditional algorithms, improving on state-of-the-art Using measures (1) and (2) gives best results; using (3) and (4) increases precision but degrades recall

43 Ambiguity Problem: ambiguity bar

44 Ambiguity Problem: ambiguity bar

45 Ambiguity Problem: ambiguity bar

46 Ambiguity Problem: ambiguity bar

47 Ambiguity Problem: ambiguity bar

48 Ambiguity Problem: ambiguity bar

49 Ambiguity Problem: ambiguity bar Main research question: can topical similarity and tight, synonym-like similarity be combined to differentiate between various senses of a word?

50 Goal: classification of nouns according to both window-based context (with large window) and syntactic context Construct three matrices capturing co-occurrence frequencies for each mode nouns cross-classified by dependency relations nouns cross-classified by (bag of words) context words dependency relations cross-classified by context words Apply nmf to matrices, but interleave the process Result of former factorization is used to initialize factorization of the next one

51 Graphical Representation 5k 80k A nouns x dependency relations k = x W 80k H 5k 2k B nouns x context words k = x V 2k G 2k 80k C context words x dependency relations 2k = x 50 80k U 50 F

52 Sense subtraction switch off dimension(s) of an ambiguous word to reveal other possible senses From matrix W, we know which dimensions are the most important for a certain word Matrix H gives the importance of each dependency relation given a dimension subtract dependency relations that are responsible for a given dimension from the original noun vector v new = v orig ( v 1 h dim ) each dependency relation is multiplied by a scaling factor, according to the load of the feature on the subtracted dimension

53 Combination with clustering A simple clustering algorithm (k-means) assigns ambiguous nouns to its predominant sense Centroid of the cluster is folded into nmf model The dimensions that define the centroid are subtracted from the ambiguous noun vector Adapted noun vector is fed to the clustering algorithm again

54 Experimental Design Approach applied to Dutch, using Twente Nieuws Corpus (± 500M words) Corpus parsed with Dutch dependency parser alpino three matrices constructed with: 5k nouns 80k dependency relations 5k nouns 2k context words 80k dependency relations 2k context words Factorization to 50 dimensions

55 Example dimension: transport 1 nouns: auto car, wagen car, tram tram, motor motorbike, bus bus, metro subway, automobilist driver, trein trein, stuur steering wheel, chauffeur driver 2 context words: auto car, trein train, motor motorbike, bus bus, rij drive, chauffeur driver, fiets bike, reiziger reiziger, passagier passenger, vervoer transport 3 dependency relations: viertraps adj four pedal, verplaats met obj move with, toeter adj honk, tank in houd obj [parsing error], tank subj refuel, tank obj refuel, rij voorbij subj pass by, rij voorbij adj pass by, rij af subj drive off, peperduur adj very expensive

56 Pop: most similar words pop music doll 1 pop, rock, jazz, meubilair furniture, popmuziek pop music, heks witch, speelgoed toy, kast cupboard, servies [tea] service, vraagteken question mark 2 pop, meubilair furniture, speelgoed toy, kast cupboard, servies [tea] service, heks witch, vraagteken question mark sieraad jewel, sculptuur sculpture, schoen shoe 3 pop, rock, jazz, popmuziek pop music, heks witch, danseres dancer, servies [tea] service, kopje cup, house house music, aap monkey

57 Barcelona: most similar words Spanish city Spanish football club 1 Barcelona, Arsenal, Inter, Juventus, Vitesse, Milaan Milan, Madrid, Parijs Paris, Wenen Vienna, München Munich 2 Barcelona, Milaan Milan, München Munich, Wenen Vienna, Madrid, Parijs Paris, Bonn, Praag Prague, Berlijn Berlin, Londen London 3 Barcelona, Arsenal, Inter, Juventus, Vitesse, Parma, Anderlecht, PSV, Feyenoord, Ajax

58 Clustering example: werk 1 werk work, beeld image, foto photo, schilderij painting, tekening drawing, doek canvas, installatie installation, afbeelding picture, sculptuur sculpture, prent picture, illustratie illustration, handschrift manuscript, grafiek print, aquarel aquarelle, maquette scale-model, collage collage, ets etching 2 werk work, boek book, titel title, roman novel, boekje booklet, debuut debut, biografie biography, bundel collection, toneelstuk play, bestseller bestseller, kinderboek child book, autobiografie autobiography, novelle short story, 3 werk work, voorziening service, arbeid labour, opvoeding education, kinderopvang child care, scholing education, huisvesting housing, faciliteit facility, accommodatie acommodation, arbeidsomstandigheid working condition

59 Evaluation: methodology Comparison to EuroWordNet senses using Wu & Palmer s Wordnet similarity measure a sense is assigned to a correct cluster if: for the top 4 words of the cluster the average similarity between the sense and the top 4 words exceeds a similarity threshold in EuroWordNet and the sense has not yet been considered correct when multiple senses exist in EuroWordNet, the one with maximum similarity is chosen

60 Evaluation: precision & recall Precision of a word: Percentage of correct clusters to which it is assigned overall: average precision of all words in test set Recall of a word: Percentage of senses in EuroWordnet that have a corresponding cluster overall: average recall of all words in test set Test set: words covered by algorithm that are also present in EuroWordNet (3683 words)

61 Evaluation: results threshold θ.40 (%).50 (%).60 (%) kmeans nmf prec rec cbc prec rec kmeans orig prec rec

62 Evaluation: results kmeans nmf beats cbc with regard to precision cbc beats kmeans nmf with regard to recall kmeans nmf has higher recall than kmeans orig, so algorithm is able to find multiple senses with good precision

63 Conclusion Combining bag of words data and syntactic data is useful bag of words data (factorized with nmf) puts its finger on topical dimensions syntactic data is particularly good at finding similar words a clustering approach allows one to determine which topical dimension(s) are responsible for a certain sense and adapt the (syntactic) feature vector of the noun accordingly subtracting the more dominant sense to discover less dominant senses Algorithm scores better with regard to precision; lower with regard to recall

64 Standard selectional preference models: two-way co-occurrences Keeping track of single relationships But: two-way selectional preference models are not sufficiently rich Compare: The skyscraper is playing coffee. The turntable is playing the piano.

65 The skyscraper is playing coffee. (play, su, scyscraper) (play, obj, coffee) The turntable is playing the piano. (play, su, turntable) (play, obj, piano) (play, su, turntable, obj, piano)

66 Three-way extraction of selectional preferences Approach applied to Dutch, using twente nieuws corpus (500M words of newspaper texts) parsed with Dutch dependency parser alpino three-way co-occurrence of verbs with subjects and direct objects extracted adapted with extension of pointwise mutual information Resulting tensor 1k verbs 10k subjects 10k direct objects non-negative tensor factorization with k dimensions (k = 50, 100, 300)

67 Graphical representation

68 Example dimension: police action subjects su s verbs v s objects obj s politie police.99 houd aan arrest.64 verdachte suspect.16 agent policeman.07 arresteer arrest.63 man man.16 autoriteit authority.05 pak op run in.41 betoger demonstrator.14 Justitie Justice.05 schiet dood shoot.08 relschopper rioter.13 recherche detective force.04 verdenk suspect.07 raddraaier instigator.13 marechaussee military police.04 tref aan find.06 overvaller raider.13 justitie justice.04 achterhaal overtake.05 Roemeen Romanian.13 arrestatieteam special squad.03 verwijder remove.05 actievoerder campaigner.13 leger army.03 zoek search.04 hooligan hooligan.13 douane customs.02 spoor op track.03 Algerijn Algerian.13

69 Example dimension: legislation subjects su s verbs v s objects obj s meerderheid majority.33 steun support.83 motie motion.63 VVD.28 dien in submit.44 voorstel proposal.53 D66.25 neem aan pass.23 plan plan.28 Kamermeerderheid.25 wijs af reject.17 wetsvoorstel bill.19 fractie party.24 verwerp reject.14 hem him.18 PvdA.23 vind think.08 kabinet cabinet.16 CDA.23 aanvaard accepts.05 minister minister.16 Tweede Kamer.21 behandel treat.05 beleid policy.13 partij party.20 doe do.04 kandidatuur candidature.11 Kamer Chamber.20 keur goed pass.03 amendement amendment.09

70 Example dimension: exhibition subjects su s verbs v s objects obj s tentoonstelling exhibition.50 toon display.72 schilderij painting.47 expositie exposition.49 omvat cover.63 werk work.46 galerie gallery.36 bevat contain.18 tekening drawing.36 collectie collection.29 presenteer present.17 foto picture.33 museum museum.27 laat let.07 sculptuur sculpture.25 oeuvre oeuvre.22 koop buy.07 aquarel aquarelle.20 Kunsthal.19 bezit own.06 object object.19 kunstenaar artist.15 zie see.05 beeld statue.12 dat that.12 koop aan acquire.05 overzicht overview.12 hij he.10 in huis heb own.04 portret portrait.11

71 Quality count 44 dimensions contain similar, framelike semantics 43 dimensions contain less clear-cut semantics single verbs account for one dimension verb senses are mixed up 13 dimensions based on syntax rather than semantics fixed expressions pronomina

72 Evaluation: methodology pseudo-disambiguation task to test generalization capacity (standard automatic evaluation for selectional preferences) s v o s o jongere drink bier coalitie aandeel youngster drink beer coalition share werkgever riskeer boete doel kopzorg employer risk fine goal worry directeur zwaai scepter informateur vodka manager sway sceptre informer wodka 10-fold cross validation (± 300,000 co-occurrences)

73 Evaluation: models Evaluation of 4 different models 2 matrix models 1k verbs (10k subjects + 10k direct objects) singular value decomposition (R) non-negative matrix factorization (R 0 ) 2 tensor models 1k verbs 10k subjects 10k direct objects parallel factor analysis (R) non-negative tensor factorization (R 0 )

74 Evaluation: results dimensions 50 (%) 100 (%) 300 (%) svd ± ± ± 1.01 nmf ± ± ± 0.63 parafac ± ± ± 0.76 ntf ± ± ± 0.16

75 Conclusion novel method able to investigate three-way co-occurrences capable of automatically inducing selectional preferences three-way methods improve on two-way methods non-negativity constraint improves on unconstrained models non-negative tensor factorization outperforms other models

76 Similarity evaluation Wordnet-based similarity Compare results to Dutch cornetto database Two similarity measures: path length: Wu & Palmer similarity measure information theoretic: Lin s similarity measure Nouns close in the hierarchy are tightly similar Pairwise similarity for k similar words Test set of ± 5000 nouns

77 Similarity evaluation Wordnet-based similarity wu & palmer s similarity lin s similarity model k = 1 k = 3 k = 5 k = 1 k = 3 k = 5 document window (w=par) window (w=2) syntax baseline syntax > window (w=2) window (w=par) > document

78 Similarity evaluation Wordnet-based similarity Syntax (with pmi) best Closely followed by small window (with pmi) Large window and document perform much worse dimensionality reduction only helps to improve document-based (a little)

79 Similarity evaluation Cluster quality Compare output of clustering algorithm to gold standard classification Two cluster tasks (esslli 2008 workshop s shared task) concrete noun categorization (44 nouns) 2-way: natural, artefact 3-way: animal, vegetable, artefact 6-way: bird, groundanimal, fruittree, green, tool, vehicle abstract/concrete noun discrimination (30 nouns) 2-way: hi, lo Evaluation measures: Entropy: distribution of classes within cluster (small = good) Purity: ratio of largest class present in cluster (large = good)

80 Similarity evaluation Cluster quality 2-way 3-way 6-way model ent pur ent pur ent pur document window (w=par) window (w=2) syntax-based syntax > window (w=2) document > window (w=par)

81 Similarity evaluation Cluster quality Same tendencies as wordnet-based similarity model with large window seems to extract topically related clusters: aardappel potato, ananas pineapple, banaan banana, champignon mushroom, fles bottle, kers cherry, ketel kettle, kip chicken, kom bowl, lepel spoon, peer pear, sla lettuce, ui onion Similar result for abstract/concrete noun discrimination

82 Similarity evaluation Domain coherence Coherence of semantic domain tags (available in cornetto) particular areas of human knowledge (politics, medicine, sports) topical similarity Ratio of most frequent domain tag (also in tagset of target word) over top 10 similar words Same test set of ± 5000 nouns

83 Similarity evaluation Domain coherence model sim topic document.394 window (w=par).399 window (w=2).414 syntax.441 baseline.048 syntax > window (w=2) = window(w=par) = document

84 Similarity evaluation Domain coherence Syntax still scores best Other models do not perform much worse No real difference between small window and large window large window and document do not extract tight similarity, but they do grasp topical similarity

Selectional Preferences 1

Selectional Preferences 1 Chapter 10 Selectional Preferences 1 10.1 Introduction Selectional preferences are a useful and versatile resource for a number of applications, such as syntactic disambiguation (Hindle and Rooth, 1993),

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde Statistical Verb-Clustering Model soft clustering: Verbs may belong to several clusters trained on verb-argument tuples clusters together verbs with similar subcategorization and selectional restriction

More information

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

More information

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Viktor PEKAR Bashkir State University Ufa, Russia, 450000 [email protected] Steffen STAAB Institute AIFB,

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015 W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction

More information

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan [email protected]

More information

Probabilistic Latent Semantic Analysis (plsa)

Probabilistic Latent Semantic Analysis (plsa) Probabilistic Latent Semantic Analysis (plsa) SS 2008 Bayesian Networks Multimedia Computing, Universität Augsburg [email protected] www.multimedia-computing.{de,org} References

More information

A chart generator for the Dutch Alpino grammar

A chart generator for the Dutch Alpino grammar June 10, 2009 Introduction Parsing: determining the grammatical structure of a sentence. Semantics: a parser can build a representation of meaning (semantics) as a side-effect of parsing a sentence. Generation:

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Movie Classification Using k-means and Hierarchical Clustering

Movie Classification Using k-means and Hierarchical Clustering Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India [email protected] Saheb Motiani

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

Search Engine Based Intelligent Help Desk System: iassist

Search Engine Based Intelligent Help Desk System: iassist Search Engine Based Intelligent Help Desk System: iassist Sahil K. Shah, Prof. Sheetal A. Takale Information Technology Department VPCOE, Baramati, Maharashtra, India [email protected], [email protected]

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Elementary Statistics

Elementary Statistics Elementary Statistics Chapter 1 Dr. Ghamsary Page 1 Elementary Statistics M. Ghamsary, Ph.D. Chap 01 1 Elementary Statistics Chapter 1 Dr. Ghamsary Page 2 Statistics: Statistics is the science of collecting,

More information

Gallito 2.0: a Natural Language Processing tool to support Research on Discourse

Gallito 2.0: a Natural Language Processing tool to support Research on Discourse Presented in the Twenty-third Annual Meeting of the Society for Text and Discourse, Valencia from 16 to 18, July 2013 Gallito 2.0: a Natural Language Processing tool to support Research on Discourse Guillermo

More information

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

The Role of Sentence Structure in Recognizing Textual Entailment

The Role of Sentence Structure in Recognizing Textual Entailment Blake,C. (In Press) The Role of Sentence Structure in Recognizing Textual Entailment. ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, Prague, Czech Republic. The Role of Sentence Structure

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization Alfio Gliozzo and Carlo Strapparava ITC-Irst via Sommarive, I-38050, Trento, ITALY {gliozzo,strappa}@itc.it

More information

Common factor analysis

Common factor analysis Common factor analysis This is what people generally mean when they say "factor analysis" This family of techniques uses an estimate of common variance among the original variables to generate the factor

More information

2014/02/13 Sphinx Lunch

2014/02/13 Sphinx Lunch 2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue

More information

Machine Learning for Data Science (CS4786) Lecture 1

Machine Learning for Data Science (CS4786) Lecture 1 Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:

More information

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority

More information

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining [email protected] Outline Predictive modeling methodology k-nearest Neighbor

More information

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt TF-IDF David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt Administrative Homework 3 available soon Assignment 2 available soon Popular media article

More information

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...

More information

Can shared mobility offer an answer to the problems with regard to transport poverty?

Can shared mobility offer an answer to the problems with regard to transport poverty? Can shared mobility offer an answer to the problems with regard to transport poverty? Is sharing the new having? Shared mobility What are we talking about? Examples Motivation Effects? An annotation Transition

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research

More information

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction Latent Semantic Indexing with Selective Query Expansion Andy Garron April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville PA 19426 Abstract This article describes

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Alignment and Preprocessing for Data Analysis

Alignment and Preprocessing for Data Analysis Alignment and Preprocessing for Data Analysis Preprocessing tools for chromatography Basics of alignment GC FID (D) data and issues PCA F Ratios GC MS (D) data and issues PCA F Ratios PARAFAC Piecewise

More information

Time Domain and Frequency Domain Techniques For Multi Shaker Time Waveform Replication

Time Domain and Frequency Domain Techniques For Multi Shaker Time Waveform Replication Time Domain and Frequency Domain Techniques For Multi Shaker Time Waveform Replication Thomas Reilly Data Physics Corporation 1741 Technology Drive, Suite 260 San Jose, CA 95110 (408) 216-8440 This paper

More information

Text Analytics. A business guide

Text Analytics. A business guide Text Analytics A business guide February 2014 Contents 3 The Business Value of Text Analytics 4 What is Text Analytics? 6 Text Analytics Methods 8 Unstructured Meets Structured Data 9 Business Application

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University Collective Behavior Prediction in Social Media Lei Tang Data Mining & Machine Learning Group Arizona State University Social Media Landscape Social Network Content Sharing Social Media Blogs Wiki Forum

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products

Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products H. Geuvers Institute for Computing and Information Sciences Intelligent Systems Version: spring 2015 H. Geuvers Version:

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

A Large Scale Evaluation of Distributional Semantic Models: Parameters, Interactions and Model Selection

A Large Scale Evaluation of Distributional Semantic Models: Parameters, Interactions and Model Selection A Large Scale Evaluation of Distributional Semantic Models: Parameters, Interactions and Model Selection Gabriella Lapesa 2,1 1 Universität Osnabrück Institut für Kognitionswissenschaft Albrechtstr. 28,

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Christopher D. Manning Department of

More information

A Toolbox for Bicluster Analysis in R

A Toolbox for Bicluster Analysis in R Sebastian Kaiser and Friedrich Leisch A Toolbox for Bicluster Analysis in R Technical Report Number 028, 2008 Department of Statistics University of Munich http://www.stat.uni-muenchen.de A Toolbox for

More information

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature CS4025: Pragmatics Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature For more info: J&M, chap 18,19 in 1 st ed; 21,24 in 2 nd Computing Science, University of

More information

Object Recognition. Selim Aksoy. Bilkent University [email protected]

Object Recognition. Selim Aksoy. Bilkent University saksoy@cs.bilkent.edu.tr Image Classification and Object Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] Image classification Image (scene) classification is a fundamental

More information

Customer Intentions Analysis of Twitter Based on Semantic Patterns

Customer Intentions Analysis of Twitter Based on Semantic Patterns Customer Intentions Analysis of Twitter Based on Semantic Patterns Mohamed Hamroun [email protected] Mohamed Salah Gouider [email protected] Lamjed Ben Said [email protected] ABSTRACT

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

Artificial Intelligence and Transactional Law: Automated M&A Due Diligence. By Ben Klaber

Artificial Intelligence and Transactional Law: Automated M&A Due Diligence. By Ben Klaber Artificial Intelligence and Transactional Law: Automated M&A Due Diligence By Ben Klaber Introduction Largely due to the pervasiveness of electronically stored information (ESI) and search and retrieval

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

Parsing Software Requirements with an Ontology-based Semantic Role Labeler Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh [email protected] Ewan Klein University of Edinburgh [email protected] Abstract Software

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Automatically Constructing a Wordnet for Dutch

Automatically Constructing a Wordnet for Dutch Automatically Constructing a Wordnet for Dutch rceal, University of Cambridge clin 21 11 February 2011 Ghent Manual construction of a semantic hierarchy is a tedious and time-consuming job Automatic methods

More information

Statistical Feature Selection Techniques for Arabic Text Categorization

Statistical Feature Selection Techniques for Arabic Text Categorization Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000

More information

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC Machine Learning for Medical Image Analysis A. Criminisi & the InnerEye team @ MSRC Medical image analysis the goal Automatic, semantic analysis and quantification of what observed in medical scans Brain

More information

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin Network Big Data: Facing and Tackling the Complexities Xiaolong Jin CAS Key Laboratory of Network Data Science & Technology Institute of Computing Technology Chinese Academy of Sciences (CAS) 2015-08-10

More information

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D. Data Mining on Social Networks Dionysios Sotiropoulos Ph.D. 1 Contents What are Social Media? Mathematical Representation of Social Networks Fundamental Data Mining Concepts Data Mining Tasks on Digital

More information

Cognitive Abilities Test Practice Activities. Te ach e r G u i d e. Form 7. Verbal Tests. Level5/6. Cog

Cognitive Abilities Test Practice Activities. Te ach e r G u i d e. Form 7. Verbal Tests. Level5/6. Cog Cognitive Abilities Test Practice Activities Te ach e r G u i d e Form 7 Verbal Tests Level5/6 Cog Test 1: Picture Analogies, Levels 5/6 7 Part 1: Overview of Picture Analogies An analogy draws parallels

More information

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

More information

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Tackling Big Data with Tensor Methods

Tackling Big Data with Tensor Methods Tackling Big Data with Tensor Methods Anima Anandkumar U.C. Irvine Learning with Big Data Data vs. Information Data vs. Information Data vs. Information Missing observations, gross corruptions, outliers.

More information

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING Mary-Elizabeth ( M-E ) Eddlestone Principal Systems Engineer, Analytics SAS Customer Loyalty, SAS Institute, Inc. Is there valuable

More information

CLUSTER ANALYSIS FOR SEGMENTATION

CLUSTER ANALYSIS FOR SEGMENTATION CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every

More information

Crossing Corpora. Modelling Semantic Similarity across Languages and Lects.

Crossing Corpora. Modelling Semantic Similarity across Languages and Lects. Distributional Models Bilectal Bilingual Crossing Corpora. Modelling Semantic Similarity across Languages and Lects. Yves Peirsman Supervisors: Dirk Geeraerts & Dirk Speelman Quantitative Lexicology and

More information

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS Divyanshu Chandola 1, Aditya Garg 2, Ankit Maurya 3, Amit Kushwaha 4 1 Student, Department of Information Technology, ABES Engineering College, Uttar Pradesh,

More information

Rank one SVD: un algorithm pour la visualisation d une matrice non négative

Rank one SVD: un algorithm pour la visualisation d une matrice non négative Rank one SVD: un algorithm pour la visualisation d une matrice non négative L. Labiod and M. Nadif LIPADE - Universite ParisDescartes, France ECAIS 2013 November 7, 2013 Outline Outline 1 Data visualization

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

TEACHING ADULTS TO REASON STATISTICALLY

TEACHING ADULTS TO REASON STATISTICALLY TEACHING ADULTS TO REASON STATISTICALLY USING THE LEARNING PROGRESSIONS T H E N AT U R E O F L E A R N I N G Mā te mōhio ka ora: mā te ora ka mōhio Through learning there is life: through life there is

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Statistics and Probability

Statistics and Probability Statistics and Probability TABLE OF CONTENTS 1 Posing Questions and Gathering Data. 2 2 Representing Data. 7 3 Interpreting and Evaluating Data 13 4 Exploring Probability..17 5 Games of Chance 20 6 Ideas

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace

More information