Unsupervised Prediction of. Laura Dietz, Steffen Bickel, Tobias Scheffer

Size: px
Start display at page:

Download "Unsupervised Prediction of. Laura Dietz, Steffen Bickel, Tobias Scheffer"

Transcription

1 Unsupervised Prediction of Citation Influences Laura Dietz, Steffen Bickel, Tobias Scheffer

2 Outline Motivation. Filter citation graphs. Predict strength of influence of links. Generative models. Copycat model. Citation influence model. Experiments. Predictive performance based on labeled data. Conclusion & future work. Laura Diet tz, Steffen Bickel, To obias Sch effer 2

3 Read on a New Topic Seed publications. Infer citation strengths. Add citation vicinity. Filter edges. Add cited publications. Filter isolated nodes. Different reasons for citing: Approaches extended by citing paper. Baselines / papers tackling the same problem. Basic / background literature. Related work to be argued against. Because everybody cites this. 3 Laura Diet tz, Steffen Bickel, To obias Sch effer

4 Read on a New Topic: Filter & Layout Laura Dietz, Steffen Bickel, Tobias Scheffer 4

5 Predict Citation Influence Problem: Given citation graph and text for documents. How strong is the influence of a cited publication? No strength labels unsupervised. Idea: LDA like generative model. Capture interactions between publications. Associate words in the publication to citations. If many words are associated to one citation. This citation had a strong influence. Laura Diet tz, Steffen Bickel, To obias Sch effer 5

6 Generative Process: Copycat Latent Dirichlet Allocation L1 dirichlet prior dirichlet prior Laura Diet 1 1 _ _ 1 2 _ 2 3 _ 3 3 _ 3 3 _ 3 3 citing doc topic mixture of cited pub topic mix of associtated cited doc c strength of citation influences cited doc associated with word topic for word in t dirichlet prior topic for word in t cited doc citing doc c tz, Steffen Bickel, To obias Sch cited doc word in cited doc w word distr for each topic word in citing doc w effer All topics are inherited from cites. for each token... for each cited doc... for each topic... for each token... for each citing doc... 6

7 Generative Process: Copycat L1 dirichlet prior dirichlet prior Laura Diet 1 1 _ _ 1 2 _ 2 3 _ 3 3 _ 3 3 _ _ 3 _ 3 topic mixture of cited pub topic mix of associtated cited doc c strength of citation influences cited doc associated with word topic for word in t dirichlet prior topic for word in t cited doc citing doc c tz, Steffen Bickel, To obias Sch 3 3 word in cited doc w word distr for each topic word in citing doc w effer for each token... for each cited doc... for each topic... for each token... for each citing doc... 7

8 Properties of the Copycat Model Tight coupling between: Citing and cited publications. Bibliographically coupled publications. Issue: all words have to be associated to a citation. Introduce noise in the shared topic mixture. Influence association process of other citing publications. Solution: Model may decide to not associate some words. Laura Diet tz, Steffen Bickel, To obias Sch effer 8

9 Generative Process: Citation Influence L1 dirichlet prior dirichlet prior Laura Diet 1 1 _ _ _ 3 3 topic mixture of cited pub topic for word in cited doc t topic mix of associtated cited doc c dirichlet prior strength of citation influences cited doc associated with word c topic for word in citing doc t tz, Steffen Bickel, To obias Sch Flip unfair coin for word. If inherit: draw topic from cites (as in copycat). word in cited doc w for each token... for each cited doc... If innovate: draw from own topic mix. word distr for each topic for each topic... word in citing doc w for each token... for each citing doc... 9 effer

10 Collapsed Gibbs Sampler Joint distribution for citation influence model: Derive collapsed Gibbs sampler for both models. Convergence monitoring (5 chains). [Brooks & Gelman 98] Laura Diet tz, Steffen Bickel, Tobias Sch effer 10

11 Experiments: Predictive Performance 26 publications from CiteSeer 04 data set. Labeled by experts on Likert scale: ++, +, -, --. Corpus = labeled publications + 1 level of citations (total: 179 docs). Laura Diet tz, Steffen Bickel, Tobias Scheffer 11

12 Experiments: Baseline Approaches LDA-JS: Similarity of topic mixtures by Jensen Shannon div. γ(c d) ) exp( -DJS(( θd θc )). LDA-post: Use Bayes` rule / chain rule to convert p(t c) to p(c t). γ(c d) Σt p(t d) p(c t). TF-IDF: Cosine TF-IDF similarity. γ(c d) cos ( TF-IDF(d),TF-IDF(c) ). PageRank: γ(c d) PageRank(c). (Graph with 3 citation levels). Laura Diet tz, Steffen Bickel, To obias Sch effer 12

13 Experiments: Evaluation Measure For each single publication d: Citation strength γd is a decision function. Area under the ROC curve (AUC): ++ vs. +, -, , + vs. -, , +, - vs. --. Average AUC values for each d. Corpus-wide evaluation measure: Mean and std. error for averaged AUCs. Paired-t-tests. Laura Diet tz, Steffen Bickel, To obias Sch effer 13

14 Experiments: Predictive Performance AUC Citation Influence Copycat LDA-JS LDA-post Number of Topics Citation influence best method on average. Laura Diet tz, Steffen Bickel, To obias Sch effer 14

15 Experiments: Predictive Performance AUC Citation Influence Copycat LDA-JS LDA-post Number of Topics Citation influence best method on average. Significantly better than LDA (paired-t-test, α=5%). Laura Diet tz, Steffen Bickel, To obias Sch effer 15

16 Experiments: Predictive Performance AUC Citation Influence Copycat LDA-JS LDA-post Number of Topics Citation influence best method on average. Significantly better than LDA (paired-t-test, α=5%). Unsensitive to number of topics. Laura Diet tz, Steffen Bickel, To obias Sch effer 16

17 Experiments: Predictive Performance AUC Citation Influence Copycat LDA-JS LDA-post Number of Topics Citation influence best method on average. Significantly better than LDA (paired-t-test, α=5%). Unsensitive to number of topics. TF-IDF and PageRank do not work (AUC 0.5). 05) Laura Diet tz, Steffen Bickel, To obias Sch effer 17

18 Experiments: Convergence Citation influence converges faster than copycat. Po otential Scale Reduction Facto or Convergence of CI (blue) vs. CC (orange) Iterations Runtime: Citation influence 46 min, copycat 3h. Laura Diet tz, Steffen Bickel, To obias Sch effer 18

19 Laura Dietz, Steffen Bickel, Tobias Scheffer 19 Narrative Evaluation: Visualization

20 Narrative Evaluation: Analyze Abstract Association of words in abstract to cites (p(w,c d)). Example: research paper Latent Dirichlet Allocation. Laura Diet tz, Steffen Bickel, Tobias Scheffer 20

21 Conclusion & Future Work Two generative models for link strength prediction. Prediction: Citation influence better than LDA. Citation influence stable according to number of topics. Runtime: Citation influence faster than copycat. Useful for filtering citation graphs. Future work: Influence strengths in social networks and web. Better authority ranking with link strengths. Laura Diet tz, Steffen Bickel, To obias Sch effer 21

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

The Hadoop Implementation. Thomas Zimmermann Philipp Berger

The Hadoop Implementation. Thomas Zimmermann Philipp Berger Link Analysis goes MapReduce The Hadoop Implementation Thomas Zimmermann Philipp Berger Flashback 2 Overview 3 1. Pre- / Postprocessing 2. Our Jobs 3. Evaluation Overview 4 1. Pre- / Postprocessing 2.

More information

Topic models for Sentiment analysis: A Literature Survey

Topic models for Sentiment analysis: A Literature Survey Topic models for Sentiment analysis: A Literature Survey Nikhilkumar Jadhav 123050033 June 26, 2014 In this report, we present the work done so far in the field of sentiment analysis using topic models.

More information

Feature Selection with Monte-Carlo Tree Search

Feature Selection with Monte-Carlo Tree Search Feature Selection with Monte-Carlo Tree Search Robert Pinsler 20.01.2015 20.01.2015 Fachbereich Informatik DKE: Seminar zu maschinellem Lernen Robert Pinsler 1 Agenda 1 Feature Selection 2 Feature Selection

More information

Extracting Information from Social Networks

Extracting Information from Social Networks Extracting Information from Social Networks Aggregating site information to get trends 1 Not limited to social networks Examples Google search logs: flu outbreaks We Feel Fine Bullying 2 Bullying Xu, Jun,

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

Understanding LDA in Source Code Analysis

Understanding LDA in Source Code Analysis Understanding LDA in Source Code Analysis David Binkley Daniel Heinz Dawn Lawrie Justin Overfelt Computer Science Department, Loyola University Maryland, Baltimore, MD, USA Department of Mathematics and

More information

Document Classification with Latent Dirichlet Allocation

Document Classification with Latent Dirichlet Allocation Document Classification with Latent Dirichlet Allocation Ph.D. Thesis Summary István Bíró Supervisor: András Lukács Ph.D. Eötvös Loránd University Faculty of Informatics Department of Information Sciences

More information

SMART Electronic Legal Discovery via Topic Modeling

SMART Electronic Legal Discovery via Topic Modeling Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference SMART Electronic Legal Discovery via Topic Modeling Clint P. George, Sahil Puri, Daisy Zhe Wang,

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Probabilistic topic models for sentiment analysis on the Web

Probabilistic topic models for sentiment analysis on the Web University of Exeter Department of Computer Science Probabilistic topic models for sentiment analysis on the Web Chenghua Lin September 2011 Submitted by Chenghua Lin, to the the University of Exeter as

More information

Mining Topics in Documents Standing on the Shoulders of Big Data. Zhiyuan (Brett) Chen and Bing Liu

Mining Topics in Documents Standing on the Shoulders of Big Data. Zhiyuan (Brett) Chen and Bing Liu Mining Topics in Documents Standing on the Shoulders of Big Data Zhiyuan (Brett) Chen and Bing Liu Topic Models Widely used in many applications Most of them are unsupervised However, topic models Require

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin Network Big Data: Facing and Tackling the Complexities Xiaolong Jin CAS Key Laboratory of Network Data Science & Technology Institute of Computing Technology Chinese Academy of Sciences (CAS) 2015-08-10

More information

Which universities lead and lag? Toward university rankings based on scholarly output

Which universities lead and lag? Toward university rankings based on scholarly output Which universities lead and lag? Toward university rankings based on scholarly output Daniel Ramage and Christopher D. Manning Computer Science Department Stanford University Stanford, California 94305

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

Introduction to Markov Chain Monte Carlo

Introduction to Markov Chain Monte Carlo Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem

More information

MICROSOFT ACCESS 2010 COURSES. Course Title: Access 2010 Level 1

MICROSOFT ACCESS 2010 COURSES. Course Title: Access 2010 Level 1 SORTED IT Training Courses Course materials and tutors provided by BUCS, bookings taken through Sorted, courses last 1.5hours and run on a rolling basis each week on Wednesday afternoons and Thursday evenings.

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please

More information

Using Data Fusion and Web Mining to Support Feature Location in Software

Using Data Fusion and Web Mining to Support Feature Location in Software Using Data Fusion and Web Mining to Support Feature Location in Software Meghan Revelle, Bogdan Dit, and Denys Poshyvanyk Department of Computer Science The College of William and Mary Williamsburg, Virginia,

More information

HES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD)

HES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD) HES-SO Master of Science in Engineering Clustering Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD) Plan Motivation Hierarchical Clustering K-Means Clustering 2 Problem Setup Arrange items

More information

The Pennsylvania State University The Graduate School MINING SOCIAL DOCUMENTS AND NETWORKS. A Thesis in Computer Science and Engineering by Ding Zhou

The Pennsylvania State University The Graduate School MINING SOCIAL DOCUMENTS AND NETWORKS. A Thesis in Computer Science and Engineering by Ding Zhou The Pennsylvania State University The Graduate School MINING SOCIAL DOCUMENTS AND NETWORKS A Thesis in Computer Science and Engineering by Ding Zhou c 2008 Ding Zhou Submitted in Partial Fulfillment of

More information

Exploratory Facebook Social Network Analysis

Exploratory Facebook Social Network Analysis Exploratory Facebook Social Network Analysis with Gephi Thomas Plotkowiak twitterresearcher.wordpress.com Process 1. Import Data with Netviz 2. Gephi 1. Open 2. Layout 3. Ranking (Degree) 4. Statistics

More information

Photolink- Fiber Optic Receiver PLR135/T1

Photolink- Fiber Optic Receiver PLR135/T1 Features High PD sensitivity optimized for red light Data : NRZ signal Low power consumption for extended battery life Built-in threshold control for improved noise Margin The product itself will remain

More information

Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Gibbs Sampling and Online Learning Introduction

Gibbs Sampling and Online Learning Introduction Statistical Techniques in Robotics (16-831, F14) Lecture#10(Tuesday, September 30) Gibbs Sampling and Online Learning Introduction Lecturer: Drew Bagnell Scribes: {Shichao Yang} 1 1 Sampling Samples are

More information

RA MODEL VISUALIZATION WITH MICROSOFT EXCEL 2013 AND GEPHI

RA MODEL VISUALIZATION WITH MICROSOFT EXCEL 2013 AND GEPHI RA MODEL VISUALIZATION WITH MICROSOFT EXCEL 2013 AND GEPHI Prepared for Prof. Martin Zwick December 9, 2014 by Teresa D. Schmidt (tds@pdx.edu) 1. DOWNLOADING AND INSTALLING USER DEFINED SPLIT FUNCTION

More information

SWIFT: A Text-mining Workbench for Systematic Review

SWIFT: A Text-mining Workbench for Systematic Review SWIFT: A Text-mining Workbench for Systematic Review Ruchir Shah, PhD Sciome LLC NTP Board of Scientific Counselors Meeting June 16, 2015 Large Literature Corpus: An Ever Increasing Challenge Systematic

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Across-Model Collective Ensemble Classification

Across-Model Collective Ensemble Classification Across-Model Collective Ensemble Classification Hoda Eldardiry and Jennifer Neville Computer Science Department Purdue University West Lafayette, IN 47907 (hdardiry neville)@cs.purdue.edu Abstract Ensemble

More information

A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration

A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration Bo Zhao 1 Benjamin I. P. Rubinstein 2 Jim Gemmell 2 Jiawei Han 1 1 Department of Computer Science, University of Illinois,

More information

CS224W Project Report: Finding Top UI/UX Design Talent on Adobe Behance

CS224W Project Report: Finding Top UI/UX Design Talent on Adobe Behance CS224W Project Report: Finding Top UI/UX Design Talent on Adobe Behance Susanne Halstead, Daniel Serrano, Scott Proctor 6 December 2014 1 Abstract The Behance social network allows professionals of diverse

More information

Gephi Tutorial Quick Start

Gephi Tutorial Quick Start Gephi Tutorial Welcome to this introduction tutorial. It will guide you to the basic steps of network visualization and manipulation in Gephi. Gephi version 0.7alpha2 was used to do this tutorial. Get

More information

Felix Aronsson. Master s Thesis. Department of Electrical and Information Technology, Faculty of Engineering, LTH, Lund University, February 2015.

Felix Aronsson. Master s Thesis. Department of Electrical and Information Technology, Faculty of Engineering, LTH, Lund University, February 2015. Master s Thesis Large scale cluster analysis with Hadoop and Mahout Felix Aronsson Department of Electrical and Information Technology, Faculty of Engineering, LTH, Lund University, February 2015. Large

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

CSV886: Social, Economics and Business Networks. Lecture 2: Affiliation and Balance. R Ravi ravi+iitd@andrew.cmu.edu

CSV886: Social, Economics and Business Networks. Lecture 2: Affiliation and Balance. R Ravi ravi+iitd@andrew.cmu.edu CSV886: Social, Economics and Business Networks Lecture 2: Affiliation and Balance R Ravi ravi+iitd@andrew.cmu.edu Granovetter s Puzzle Resolved Strong Triadic Closure holds in most nodes in social networks

More information

Jure Leskovec (@jure) Stanford University

Jure Leskovec (@jure) Stanford University Jure Leskovec (@jure) Stanford University KDD Summer School, Beijing, August 2012 8/10/2012 Jure Leskovec (@jure), KDD Summer School 2012 2 Graph: Kronecker graphs Graph Node attributes: MAG model Graph

More information

Supervised Learning Evaluation (via Sentiment Analysis)!

Supervised Learning Evaluation (via Sentiment Analysis)! Supervised Learning Evaluation (via Sentiment Analysis)! Why Analyze Sentiment? Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents

More information

Characterization of Latent Social Networks Discovered through Computer Network Logs

Characterization of Latent Social Networks Discovered through Computer Network Logs Characterization of Latent Social Networks Discovered through Computer Network Logs Kevin M. Carter MIT Lincoln Laboratory 244 Wood St Lexington, MA 02420 kevin.carter@ll.mit.edu Rajmonda S. Caceres MIT

More information

Topical Authority Identification in Community Question Answering

Topical Authority Identification in Community Question Answering Topical Authority Identification in Community Question Answering Guangyou Zhou, Kang Liu, and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences 95

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

The NIH Visual Browser: An Interactive Visualization of Biomedical Research

The NIH Visual Browser: An Interactive Visualization of Biomedical Research The NIH Visual Browser: An Interactive Visualization of Biomedical Research Bruce W. Herr II, Edmund M. Talley, Gully A.P.C. Burns, David Newman, Gavin LaRowe ChalkLabs, National Institute of Neurological

More information

8. Machine Learning Applied Artificial Intelligence

8. Machine Learning Applied Artificial Intelligence 8. Machine Learning Applied Artificial Intelligence Prof. Dr. Bernhard Humm Faculty of Computer Science Hochschule Darmstadt University of Applied Sciences 1 Retrospective Natural Language Processing Name

More information

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Gaussian Processes to Speed up Hamiltonian Monte Carlo Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo

More information

Machine Learning for Display Advertising

Machine Learning for Display Advertising Machine Learning for Display Advertising The Idea: Social Targeting for Online Advertising Performance Current ad spending seems disproportionate Online Advertising Spending Breakdown (2009) Source: IAB

More information

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations Jurij Leskovec, CMU Jon Kleinberg, Cornell Christos Faloutsos, CMU 1 Introduction What can we do with graphs? What patterns

More information

Leveraging Careful Microblog Users for Spammer Detection

Leveraging Careful Microblog Users for Spammer Detection Leveraging Careful Microblog Users for Spammer Detection Hao Fu University of Science and Technology of China fuch@mail.ustc.edu.cn Xing Xie Microsoft Research xingx@microsoft.com Yong Rui Microsoft Research

More information

The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Mining Topics in Documents: Standing on the Shoulders of Big Data

Mining Topics in Documents: Standing on the Shoulders of Big Data Mining Topics in Documents: Standing on the Shoulders of Big Data Zhiyuan Chen and Bing Liu Department of Computer Science University of Illinois at Chicago czyuanacm@gmail.com, liub@cs.uic.edu ABSTRACT

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

CS 207 - Data Science and Visualization Spring 2016

CS 207 - Data Science and Visualization Spring 2016 CS 207 - Data Science and Visualization Spring 2016 Professor: Sorelle Friedler sorelle@cs.haverford.edu An introduction to techniques for the automated and human-assisted analysis of data sets. These

More information

Sterile E Online NetworkTutorials

Sterile E Online NetworkTutorials 3M Sterilization Assurance Products EO Blue bar must enter area Sterile E Online NetworkTutorials EO Blue bar must enter area Understanding the Differences Between Class 5 s and Class 6 Emulating Indicators

More information

Machine Learning for Data Science (CS4786) Lecture 1

Machine Learning for Data Science (CS4786) Lecture 1 Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:

More information

Multiple Location Profiling for Users and Relationships from Social Network and Content

Multiple Location Profiling for Users and Relationships from Social Network and Content Multiple Location Profiling for Users and Relationships from Social Network and Content Rui Li ruili1@illinois.edu Shengjie Wang wang260@illinois.edu Kevin Chen-Chuan Chang, kcchang@illinois.edu Department

More information

Bayes and Naïve Bayes. cs534-machine Learning

Bayes and Naïve Bayes. cs534-machine Learning Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

A neo4j powered social networking and Question & Answer application to enhance scientific communication. René Pickhardt, Heinrich Hartmann

A neo4j powered social networking and Question & Answer application to enhance scientific communication. René Pickhardt, Heinrich Hartmann A neo4j powered social networking and Question & Answer application to enhance scientific communication. René Pickhardt, Heinrich Hartmann related-work.net Roadmap Introduction Data structures for Q &

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of

More information

A Survey of Document Clustering Techniques & Comparison of LDA and movmf

A Survey of Document Clustering Techniques & Comparison of LDA and movmf A Survey of Document Clustering Techniques & Comparison of LDA and movmf Yu Xiao December 10, 2010 Abstract This course project is mainly an overview of some widely used document clustering techniques.

More information

FINDING EXPERT USERS IN COMMUNITY QUESTION ANSWERING SERVICES USING TOPIC MODELS

FINDING EXPERT USERS IN COMMUNITY QUESTION ANSWERING SERVICES USING TOPIC MODELS FINDING EXPERT USERS IN COMMUNITY QUESTION ANSWERING SERVICES USING TOPIC MODELS by Fatemeh Riahi Submitted in partial fulfillment of the requirements for the degree of Master of Computer Science at Dalhousie

More information

Visipedia: collaborative harvesting and organization of visual knowledge

Visipedia: collaborative harvesting and organization of visual knowledge Visipedia: collaborative harvesting and organization of visual knowledge Pietro Perona California Institute of Technology 2014 Signal and Image Sciences Workshop Lawrence Livermore National Laboratory

More information

A Regime-Switching Relative Value Arbitrage Rule

A Regime-Switching Relative Value Arbitrage Rule A Regime-Switching Relative Value Arbitrage Rule Michael Bock and Roland Mestel University of Graz, Institute for Banking and Finance Universitaetsstrasse 15/F2, A-8010 Graz, Austria {michael.bock,roland.mestel}@uni-graz.at

More information

Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model Chaitanya Chemudugunta, Padhraic Smyth Department of Computer Science University of California, Irvine Irvine, CA 92697-3435,

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Exponential Family Harmoniums with an Application to Information Retrieval

Exponential Family Harmoniums with an Application to Information Retrieval Exponential Family Harmoniums with an Application to Information Retrieval Max Welling & Michal Rosen-Zvi Information and Computer Science University of California Irvine CA 92697-3425 USA welling@ics.uci.edu

More information

Switching from PC SAS to SAS Enterprise Guide Zhengxin (Cindy) Yang, inventiv Health Clinical, Princeton, NJ

Switching from PC SAS to SAS Enterprise Guide Zhengxin (Cindy) Yang, inventiv Health Clinical, Princeton, NJ PharmaSUG 2014 PO10 Switching from PC SAS to SAS Enterprise Guide Zhengxin (Cindy) Yang, inventiv Health Clinical, Princeton, NJ ABSTRACT As more and more organizations adapt to the SAS Enterprise Guide,

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)

More information

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the

More information

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science Data Intensive Computing CSE 486/586 Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING Masters in Computer Science University at Buffalo Website: http://www.acsu.buffalo.edu/~mjalimin/

More information

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce

More information

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form. One-Degree-of-Freedom Tests Test for group occasion interactions has (number of groups 1) number of occasions 1) degrees of freedom. This can dilute the significance of a departure from the null hypothesis.

More information

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

More information

Self-Adjusting Condition Monitoring of PV Power Plant Production

Self-Adjusting Condition Monitoring of PV Power Plant Production Self-Adjusting Condition Monitoring of PV Power Plant Production Mining Measurement Data Dipl.-Inf. Steffen Dienst M.Sc. Soheila Sahami M.Sc. Johannes Schmidt {sdienst sahami}@informatik.uni-leipzig.de

More information

Chenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu)

Chenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu) Paper Author (s) Chenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu) Lei Zhang, University of Maryland, College Park (lei@umd.edu) Paper Title & Number Dynamic Travel

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

More information

Completing the Big Data Ecosystem:

Completing the Big Data Ecosystem: Completing the Big Data Ecosystem: in sqrrl data INC. August 3, 2012 Design Drivers in Analysis of big data is central to our customers requirements, in which the strongest drivers are: Scalability: The

More information

Continuous-Time Converter Architectures for Integrated Audio Processors: By Brian Trotter, Cirrus Logic, Inc. September 2008

Continuous-Time Converter Architectures for Integrated Audio Processors: By Brian Trotter, Cirrus Logic, Inc. September 2008 Continuous-Time Converter Architectures for Integrated Audio Processors: By Brian Trotter, Cirrus Logic, Inc. September 2008 As consumer electronics devices continue to both decrease in size and increase

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction

More information

A semi-supervised Spam mail detector

A semi-supervised Spam mail detector A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating Window-Based Passage-Level Evidence in Document Retrieval Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

More information

Evidence-based Software Process Recovery

Evidence-based Software Process Recovery Evidence-based Software Process Recovery by Abram Hindle A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science

More information

Social Media Mining. Network Measures

Social Media Mining. Network Measures Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Visualizing Software Projects in JavaScript

Visualizing Software Projects in JavaScript Visualizing Software Projects in JavaScript Tim Disney Abstract Visualization techniques have been used to help programmers deepen their understanding of large software projects. However the existing visualization

More information

USING RUST TO BUILD THE NEXT GENERATION WEB BROWSER

USING RUST TO BUILD THE NEXT GENERATION WEB BROWSER USING RUST TO BUILD THE NEXT GENERATION WEB BROWSER Lars Bergstrom Mozilla Research Mike Blumenkrantz Samsung R&D America Why a new web engine? Support new types of applications and new devices All modern

More information

Sheeba J.I1, Vivekanandan K2

Sheeba J.I1, Vivekanandan K2 IMPROVED UNSUPERVISED FRAMEWORK FOR SOLVING SYNONYM, HOMONYM, HYPONYMY & POLYSEMY PROBLEMS FROM EXTRACTED KEYWORDS AND IDENTIFY TOPICS IN MEETING TRANSCRIPTS Sheeba J.I1, Vivekanandan K2 1 Assistant Professor,sheeba@pec.edu

More information