Unsupervised Prediction of. Laura Dietz, Steffen Bickel, Tobias Scheffer

Size: px

Start display at page:

Download "Unsupervised Prediction of. Laura Dietz, Steffen Bickel, Tobias Scheffer"

Clemence Gray
5 years ago
Views:

1 Unsupervised Prediction of Citation Influences Laura Dietz, Steffen Bickel, Tobias Scheffer

2 Outline Motivation. Filter citation graphs. Predict strength of influence of links. Generative models. Copycat model. Citation influence model. Experiments. Predictive performance based on labeled data. Conclusion & future work. Laura Diet tz, Steffen Bickel, To obias Sch effer 2

3 Read on a New Topic Seed publications. Infer citation strengths. Add citation vicinity. Filter edges. Add cited publications. Filter isolated nodes. Different reasons for citing: Approaches extended by citing paper. Baselines / papers tackling the same problem. Basic / background literature. Related work to be argued against. Because everybody cites this. 3 Laura Diet tz, Steffen Bickel, To obias Sch effer

4 Read on a New Topic: Filter & Layout Laura Dietz, Steffen Bickel, Tobias Scheffer 4

5 Predict Citation Influence Problem: Given citation graph and text for documents. How strong is the influence of a cited publication? No strength labels unsupervised. Idea: LDA like generative model. Capture interactions between publications. Associate words in the publication to citations. If many words are associated to one citation. This citation had a strong influence. Laura Diet tz, Steffen Bickel, To obias Sch effer 5

6 Generative Process: Copycat Latent Dirichlet Allocation L1 dirichlet prior dirichlet prior Laura Diet 1 1 _ _ 1 2 _ 2 3 _ 3 3 _ 3 3 _ 3 3 citing doc topic mixture of cited pub topic mix of associtated cited doc c strength of citation influences cited doc associated with word topic for word in t dirichlet prior topic for word in t cited doc citing doc c tz, Steffen Bickel, To obias Sch cited doc word in cited doc w word distr for each topic word in citing doc w effer All topics are inherited from cites. for each token... for each cited doc... for each topic... for each token... for each citing doc... 6

7 Generative Process: Copycat L1 dirichlet prior dirichlet prior Laura Diet 1 1 _ _ 1 2 _ 2 3 _ 3 3 _ 3 3 _ _ 3 _ 3 topic mixture of cited pub topic mix of associtated cited doc c strength of citation influences cited doc associated with word topic for word in t dirichlet prior topic for word in t cited doc citing doc c tz, Steffen Bickel, To obias Sch 3 3 word in cited doc w word distr for each topic word in citing doc w effer for each token... for each cited doc... for each topic... for each token... for each citing doc... 7

8 Properties of the Copycat Model Tight coupling between: Citing and cited publications. Bibliographically coupled publications. Issue: all words have to be associated to a citation. Introduce noise in the shared topic mixture. Influence association process of other citing publications. Solution: Model may decide to not associate some words. Laura Diet tz, Steffen Bickel, To obias Sch effer 8

9 Generative Process: Citation Influence L1 dirichlet prior dirichlet prior Laura Diet 1 1 _ _ _ 3 3 topic mixture of cited pub topic for word in cited doc t topic mix of associtated cited doc c dirichlet prior strength of citation influences cited doc associated with word c topic for word in citing doc t tz, Steffen Bickel, To obias Sch Flip unfair coin for word. If inherit: draw topic from cites (as in copycat). word in cited doc w for each token... for each cited doc... If innovate: draw from own topic mix. word distr for each topic for each topic... word in citing doc w for each token... for each citing doc... 9 effer

10 Collapsed Gibbs Sampler Joint distribution for citation influence model: Derive collapsed Gibbs sampler for both models. Convergence monitoring (5 chains). [Brooks & Gelman 98] Laura Diet tz, Steffen Bickel, Tobias Sch effer 10

11 Experiments: Predictive Performance 26 publications from CiteSeer 04 data set. Labeled by experts on Likert scale: ++, +, -, --. Corpus = labeled publications + 1 level of citations (total: 179 docs). Laura Diet tz, Steffen Bickel, Tobias Scheffer 11

12 Experiments: Baseline Approaches LDA-JS: Similarity of topic mixtures by Jensen Shannon div. γ(c d) ) exp( -DJS(( θd θc )). LDA-post: Use Bayes` rule / chain rule to convert p(t c) to p(c t). γ(c d) Σt p(t d) p(c t). TF-IDF: Cosine TF-IDF similarity. γ(c d) cos ( TF-IDF(d),TF-IDF(c) ). PageRank: γ(c d) PageRank(c). (Graph with 3 citation levels). Laura Diet tz, Steffen Bickel, To obias Sch effer 12

13 Experiments: Evaluation Measure For each single publication d: Citation strength γd is a decision function. Area under the ROC curve (AUC): ++ vs. +, -, , + vs. -, , +, - vs. --. Average AUC values for each d. Corpus-wide evaluation measure: Mean and std. error for averaged AUCs. Paired-t-tests. Laura Diet tz, Steffen Bickel, To obias Sch effer 13

14 Experiments: Predictive Performance AUC Citation Influence Copycat LDA-JS LDA-post Number of Topics Citation influence best method on average. Laura Diet tz, Steffen Bickel, To obias Sch effer 14

15 Experiments: Predictive Performance AUC Citation Influence Copycat LDA-JS LDA-post Number of Topics Citation influence best method on average. Significantly better than LDA (paired-t-test, α=5%). Laura Diet tz, Steffen Bickel, To obias Sch effer 15

16 Experiments: Predictive Performance AUC Citation Influence Copycat LDA-JS LDA-post Number of Topics Citation influence best method on average. Significantly better than LDA (paired-t-test, α=5%). Unsensitive to number of topics. Laura Diet tz, Steffen Bickel, To obias Sch effer 16

17 Experiments: Predictive Performance AUC Citation Influence Copycat LDA-JS LDA-post Number of Topics Citation influence best method on average. Significantly better than LDA (paired-t-test, α=5%). Unsensitive to number of topics. TF-IDF and PageRank do not work (AUC 0.5). 05) Laura Diet tz, Steffen Bickel, To obias Sch effer 17

18 Experiments: Convergence Citation influence converges faster than copycat. Po otential Scale Reduction Facto or Convergence of CI (blue) vs. CC (orange) Iterations Runtime: Citation influence 46 min, copycat 3h. Laura Diet tz, Steffen Bickel, To obias Sch effer 18

19 Laura Dietz, Steffen Bickel, Tobias Scheffer 19 Narrative Evaluation: Visualization

20 Narrative Evaluation: Analyze Abstract Association of words in abstract to cites (p(w,c d)). Example: research paper Latent Dirichlet Allocation. Laura Diet tz, Steffen Bickel, Tobias Scheffer 20

21 Conclusion & Future Work Two generative models for link strength prediction. Prediction: Citation influence better than LDA. Citation influence stable according to number of topics. Runtime: Citation influence faster than copycat. Useful for filtering citation graphs. Future work: Influence strengths in social networks and web. Better authority ranking with link strengths. Laura Diet tz, Steffen Bickel, To obias Sch effer 21

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian