Unsupervised Prediction of. Laura Dietz, Steffen Bickel, Tobias Scheffer
|
|
- Clemence Gray
- 5 years ago
- Views:
Transcription
1 Unsupervised Prediction of Citation Influences Laura Dietz, Steffen Bickel, Tobias Scheffer
2 Outline Motivation. Filter citation graphs. Predict strength of influence of links. Generative models. Copycat model. Citation influence model. Experiments. Predictive performance based on labeled data. Conclusion & future work. Laura Diet tz, Steffen Bickel, To obias Sch effer 2
3 Read on a New Topic Seed publications. Infer citation strengths. Add citation vicinity. Filter edges. Add cited publications. Filter isolated nodes. Different reasons for citing: Approaches extended by citing paper. Baselines / papers tackling the same problem. Basic / background literature. Related work to be argued against. Because everybody cites this. 3 Laura Diet tz, Steffen Bickel, To obias Sch effer
4 Read on a New Topic: Filter & Layout Laura Dietz, Steffen Bickel, Tobias Scheffer 4
5 Predict Citation Influence Problem: Given citation graph and text for documents. How strong is the influence of a cited publication? No strength labels unsupervised. Idea: LDA like generative model. Capture interactions between publications. Associate words in the publication to citations. If many words are associated to one citation. This citation had a strong influence. Laura Diet tz, Steffen Bickel, To obias Sch effer 5
6 Generative Process: Copycat Latent Dirichlet Allocation L1 dirichlet prior dirichlet prior Laura Diet 1 1 _ _ 1 2 _ 2 3 _ 3 3 _ 3 3 _ 3 3 citing doc topic mixture of cited pub topic mix of associtated cited doc c strength of citation influences cited doc associated with word topic for word in t dirichlet prior topic for word in t cited doc citing doc c tz, Steffen Bickel, To obias Sch cited doc word in cited doc w word distr for each topic word in citing doc w effer All topics are inherited from cites. for each token... for each cited doc... for each topic... for each token... for each citing doc... 6
7 Generative Process: Copycat L1 dirichlet prior dirichlet prior Laura Diet 1 1 _ _ 1 2 _ 2 3 _ 3 3 _ 3 3 _ _ 3 _ 3 topic mixture of cited pub topic mix of associtated cited doc c strength of citation influences cited doc associated with word topic for word in t dirichlet prior topic for word in t cited doc citing doc c tz, Steffen Bickel, To obias Sch 3 3 word in cited doc w word distr for each topic word in citing doc w effer for each token... for each cited doc... for each topic... for each token... for each citing doc... 7
8 Properties of the Copycat Model Tight coupling between: Citing and cited publications. Bibliographically coupled publications. Issue: all words have to be associated to a citation. Introduce noise in the shared topic mixture. Influence association process of other citing publications. Solution: Model may decide to not associate some words. Laura Diet tz, Steffen Bickel, To obias Sch effer 8
9 Generative Process: Citation Influence L1 dirichlet prior dirichlet prior Laura Diet 1 1 _ _ _ 3 3 topic mixture of cited pub topic for word in cited doc t topic mix of associtated cited doc c dirichlet prior strength of citation influences cited doc associated with word c topic for word in citing doc t tz, Steffen Bickel, To obias Sch Flip unfair coin for word. If inherit: draw topic from cites (as in copycat). word in cited doc w for each token... for each cited doc... If innovate: draw from own topic mix. word distr for each topic for each topic... word in citing doc w for each token... for each citing doc... 9 effer
10 Collapsed Gibbs Sampler Joint distribution for citation influence model: Derive collapsed Gibbs sampler for both models. Convergence monitoring (5 chains). [Brooks & Gelman 98] Laura Diet tz, Steffen Bickel, Tobias Sch effer 10
11 Experiments: Predictive Performance 26 publications from CiteSeer 04 data set. Labeled by experts on Likert scale: ++, +, -, --. Corpus = labeled publications + 1 level of citations (total: 179 docs). Laura Diet tz, Steffen Bickel, Tobias Scheffer 11
12 Experiments: Baseline Approaches LDA-JS: Similarity of topic mixtures by Jensen Shannon div. γ(c d) ) exp( -DJS(( θd θc )). LDA-post: Use Bayes` rule / chain rule to convert p(t c) to p(c t). γ(c d) Σt p(t d) p(c t). TF-IDF: Cosine TF-IDF similarity. γ(c d) cos ( TF-IDF(d),TF-IDF(c) ). PageRank: γ(c d) PageRank(c). (Graph with 3 citation levels). Laura Diet tz, Steffen Bickel, To obias Sch effer 12
13 Experiments: Evaluation Measure For each single publication d: Citation strength γd is a decision function. Area under the ROC curve (AUC): ++ vs. +, -, , + vs. -, , +, - vs. --. Average AUC values for each d. Corpus-wide evaluation measure: Mean and std. error for averaged AUCs. Paired-t-tests. Laura Diet tz, Steffen Bickel, To obias Sch effer 13
14 Experiments: Predictive Performance AUC Citation Influence Copycat LDA-JS LDA-post Number of Topics Citation influence best method on average. Laura Diet tz, Steffen Bickel, To obias Sch effer 14
15 Experiments: Predictive Performance AUC Citation Influence Copycat LDA-JS LDA-post Number of Topics Citation influence best method on average. Significantly better than LDA (paired-t-test, α=5%). Laura Diet tz, Steffen Bickel, To obias Sch effer 15
16 Experiments: Predictive Performance AUC Citation Influence Copycat LDA-JS LDA-post Number of Topics Citation influence best method on average. Significantly better than LDA (paired-t-test, α=5%). Unsensitive to number of topics. Laura Diet tz, Steffen Bickel, To obias Sch effer 16
17 Experiments: Predictive Performance AUC Citation Influence Copycat LDA-JS LDA-post Number of Topics Citation influence best method on average. Significantly better than LDA (paired-t-test, α=5%). Unsensitive to number of topics. TF-IDF and PageRank do not work (AUC 0.5). 05) Laura Diet tz, Steffen Bickel, To obias Sch effer 17
18 Experiments: Convergence Citation influence converges faster than copycat. Po otential Scale Reduction Facto or Convergence of CI (blue) vs. CC (orange) Iterations Runtime: Citation influence 46 min, copycat 3h. Laura Diet tz, Steffen Bickel, To obias Sch effer 18
19 Laura Dietz, Steffen Bickel, Tobias Scheffer 19 Narrative Evaluation: Visualization
20 Narrative Evaluation: Analyze Abstract Association of words in abstract to cites (p(w,c d)). Example: research paper Latent Dirichlet Allocation. Laura Diet tz, Steffen Bickel, Tobias Scheffer 20
21 Conclusion & Future Work Two generative models for link strength prediction. Prediction: Citation influence better than LDA. Citation influence stable according to number of topics. Runtime: Citation influence faster than copycat. Useful for filtering citation graphs. Future work: Influence strengths in social networks and web. Better authority ranking with link strengths. Laura Diet tz, Steffen Bickel, To obias Sch effer 21
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
More informationThe Hadoop Implementation. Thomas Zimmermann Philipp Berger
Link Analysis goes MapReduce The Hadoop Implementation Thomas Zimmermann Philipp Berger Flashback 2 Overview 3 1. Pre- / Postprocessing 2. Our Jobs 3. Evaluation Overview 4 1. Pre- / Postprocessing 2.
More informationTopic models for Sentiment analysis: A Literature Survey
Topic models for Sentiment analysis: A Literature Survey Nikhilkumar Jadhav 123050033 June 26, 2014 In this report, we present the work done so far in the field of sentiment analysis using topic models.
More informationFeature Selection with Monte-Carlo Tree Search
Feature Selection with Monte-Carlo Tree Search Robert Pinsler 20.01.2015 20.01.2015 Fachbereich Informatik DKE: Seminar zu maschinellem Lernen Robert Pinsler 1 Agenda 1 Feature Selection 2 Feature Selection
More informationExtracting Information from Social Networks
Extracting Information from Social Networks Aggregating site information to get trends 1 Not limited to social networks Examples Google search logs: flu outbreaks We Feel Fine Bullying 2 Bullying Xu, Jun,
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
More informationCourse: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.
More informationUnderstanding LDA in Source Code Analysis
Understanding LDA in Source Code Analysis David Binkley Daniel Heinz Dawn Lawrie Justin Overfelt Computer Science Department, Loyola University Maryland, Baltimore, MD, USA Department of Mathematics and
More informationDocument Classification with Latent Dirichlet Allocation
Document Classification with Latent Dirichlet Allocation Ph.D. Thesis Summary István Bíró Supervisor: András Lukács Ph.D. Eötvös Loránd University Faculty of Informatics Department of Information Sciences
More informationSMART Electronic Legal Discovery via Topic Modeling
Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference SMART Electronic Legal Discovery via Topic Modeling Clint P. George, Sahil Puri, Daisy Zhe Wang,
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationProbabilistic topic models for sentiment analysis on the Web
University of Exeter Department of Computer Science Probabilistic topic models for sentiment analysis on the Web Chenghua Lin September 2011 Submitted by Chenghua Lin, to the the University of Exeter as
More informationMining Topics in Documents Standing on the Shoulders of Big Data. Zhiyuan (Brett) Chen and Bing Liu
Mining Topics in Documents Standing on the Shoulders of Big Data Zhiyuan (Brett) Chen and Bing Liu Topic Models Widely used in many applications Most of them are unsupervised However, topic models Require
More informationMHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
More informationNetwork Big Data: Facing and Tackling the Complexities Xiaolong Jin
Network Big Data: Facing and Tackling the Complexities Xiaolong Jin CAS Key Laboratory of Network Data Science & Technology Institute of Computing Technology Chinese Academy of Sciences (CAS) 2015-08-10
More informationWhich universities lead and lag? Toward university rankings based on scholarly output
Which universities lead and lag? Toward university rankings based on scholarly output Daniel Ramage and Christopher D. Manning Computer Science Department Stanford University Stanford, California 94305
More informationIdentifying Focus, Techniques and Domain of Scientific Papers
Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of
More informationSemi-Supervised Support Vector Machines and Application to Spam Filtering
Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
More informationIntroduction to Markov Chain Monte Carlo
Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem
More informationMICROSOFT ACCESS 2010 COURSES. Course Title: Access 2010 Level 1
SORTED IT Training Courses Course materials and tutors provided by BUCS, bookings taken through Sorted, courses last 1.5hours and run on a rolling basis each week on Wednesday afternoons and Thursday evenings.
More informationPart 1: Link Analysis & Page Rank
Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please
More informationUsing Data Fusion and Web Mining to Support Feature Location in Software
Using Data Fusion and Web Mining to Support Feature Location in Software Meghan Revelle, Bogdan Dit, and Denys Poshyvanyk Department of Computer Science The College of William and Mary Williamsburg, Virginia,
More informationHES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD)
HES-SO Master of Science in Engineering Clustering Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD) Plan Motivation Hierarchical Clustering K-Means Clustering 2 Problem Setup Arrange items
More informationThe Pennsylvania State University The Graduate School MINING SOCIAL DOCUMENTS AND NETWORKS. A Thesis in Computer Science and Engineering by Ding Zhou
The Pennsylvania State University The Graduate School MINING SOCIAL DOCUMENTS AND NETWORKS A Thesis in Computer Science and Engineering by Ding Zhou c 2008 Ding Zhou Submitted in Partial Fulfillment of
More informationExploratory Facebook Social Network Analysis
Exploratory Facebook Social Network Analysis with Gephi Thomas Plotkowiak twitterresearcher.wordpress.com Process 1. Import Data with Netviz 2. Gephi 1. Open 2. Layout 3. Ranking (Degree) 4. Statistics
More informationPhotolink- Fiber Optic Receiver PLR135/T1
Features High PD sensitivity optimized for red light Data : NRZ signal Low power consumption for extended battery life Built-in threshold control for improved noise Margin The product itself will remain
More informationBayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More information1 o Semestre 2007/2008
Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationGibbs Sampling and Online Learning Introduction
Statistical Techniques in Robotics (16-831, F14) Lecture#10(Tuesday, September 30) Gibbs Sampling and Online Learning Introduction Lecturer: Drew Bagnell Scribes: {Shichao Yang} 1 1 Sampling Samples are
More informationRA MODEL VISUALIZATION WITH MICROSOFT EXCEL 2013 AND GEPHI
RA MODEL VISUALIZATION WITH MICROSOFT EXCEL 2013 AND GEPHI Prepared for Prof. Martin Zwick December 9, 2014 by Teresa D. Schmidt (tds@pdx.edu) 1. DOWNLOADING AND INSTALLING USER DEFINED SPLIT FUNCTION
More informationSWIFT: A Text-mining Workbench for Systematic Review
SWIFT: A Text-mining Workbench for Systematic Review Ruchir Shah, PhD Sciome LLC NTP Board of Scientific Counselors Meeting June 16, 2015 Large Literature Corpus: An Ever Increasing Challenge Systematic
More informationData Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority
More informationAcross-Model Collective Ensemble Classification
Across-Model Collective Ensemble Classification Hoda Eldardiry and Jennifer Neville Computer Science Department Purdue University West Lafayette, IN 47907 (hdardiry neville)@cs.purdue.edu Abstract Ensemble
More informationA Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration
A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration Bo Zhao 1 Benjamin I. P. Rubinstein 2 Jim Gemmell 2 Jiawei Han 1 1 Department of Computer Science, University of Illinois,
More informationCS224W Project Report: Finding Top UI/UX Design Talent on Adobe Behance
CS224W Project Report: Finding Top UI/UX Design Talent on Adobe Behance Susanne Halstead, Daniel Serrano, Scott Proctor 6 December 2014 1 Abstract The Behance social network allows professionals of diverse
More informationGephi Tutorial Quick Start
Gephi Tutorial Welcome to this introduction tutorial. It will guide you to the basic steps of network visualization and manipulation in Gephi. Gephi version 0.7alpha2 was used to do this tutorial. Get
More informationFelix Aronsson. Master s Thesis. Department of Electrical and Information Technology, Faculty of Engineering, LTH, Lund University, February 2015.
Master s Thesis Large scale cluster analysis with Hadoop and Mahout Felix Aronsson Department of Electrical and Information Technology, Faculty of Engineering, LTH, Lund University, February 2015. Large
More informationPredicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
More informationClustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
More informationCSV886: Social, Economics and Business Networks. Lecture 2: Affiliation and Balance. R Ravi ravi+iitd@andrew.cmu.edu
CSV886: Social, Economics and Business Networks Lecture 2: Affiliation and Balance R Ravi ravi+iitd@andrew.cmu.edu Granovetter s Puzzle Resolved Strong Triadic Closure holds in most nodes in social networks
More informationJure Leskovec (@jure) Stanford University
Jure Leskovec (@jure) Stanford University KDD Summer School, Beijing, August 2012 8/10/2012 Jure Leskovec (@jure), KDD Summer School 2012 2 Graph: Kronecker graphs Graph Node attributes: MAG model Graph
More informationSupervised Learning Evaluation (via Sentiment Analysis)!
Supervised Learning Evaluation (via Sentiment Analysis)! Why Analyze Sentiment? Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents
More informationCharacterization of Latent Social Networks Discovered through Computer Network Logs
Characterization of Latent Social Networks Discovered through Computer Network Logs Kevin M. Carter MIT Lincoln Laboratory 244 Wood St Lexington, MA 02420 kevin.carter@ll.mit.edu Rajmonda S. Caceres MIT
More informationTopical Authority Identification in Community Question Answering
Topical Authority Identification in Community Question Answering Guangyou Zhou, Kang Liu, and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences 95
More informationAnti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
More informationThe NIH Visual Browser: An Interactive Visualization of Biomedical Research
The NIH Visual Browser: An Interactive Visualization of Biomedical Research Bruce W. Herr II, Edmund M. Talley, Gully A.P.C. Burns, David Newman, Gavin LaRowe ChalkLabs, National Institute of Neurological
More information8. Machine Learning Applied Artificial Intelligence
8. Machine Learning Applied Artificial Intelligence Prof. Dr. Bernhard Humm Faculty of Computer Science Hochschule Darmstadt University of Applied Sciences 1 Retrospective Natural Language Processing Name
More informationGaussian Processes to Speed up Hamiltonian Monte Carlo
Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo
More informationMachine Learning for Display Advertising
Machine Learning for Display Advertising The Idea: Social Targeting for Online Advertising Performance Current ad spending seems disproportionate Online Advertising Spending Breakdown (2009) Source: IAB
More informationGraphs over Time Densification Laws, Shrinking Diameters and Possible Explanations
Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations Jurij Leskovec, CMU Jon Kleinberg, Cornell Christos Faloutsos, CMU 1 Introduction What can we do with graphs? What patterns
More informationLeveraging Careful Microblog Users for Spammer Detection
Leveraging Careful Microblog Users for Spammer Detection Hao Fu University of Science and Technology of China fuch@mail.ustc.edu.cn Xing Xie Microsoft Research xingx@microsoft.com Yong Rui Microsoft Research
More informationThe PageRank Citation Ranking: Bring Order to the Web
The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized
More informationBayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationMining Topics in Documents: Standing on the Shoulders of Big Data
Mining Topics in Documents: Standing on the Shoulders of Big Data Zhiyuan Chen and Bing Liu Department of Computer Science University of Illinois at Chicago czyuanacm@gmail.com, liub@cs.uic.edu ABSTRACT
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationCS 207 - Data Science and Visualization Spring 2016
CS 207 - Data Science and Visualization Spring 2016 Professor: Sorelle Friedler sorelle@cs.haverford.edu An introduction to techniques for the automated and human-assisted analysis of data sets. These
More informationSterile E Online NetworkTutorials
3M Sterilization Assurance Products EO Blue bar must enter area Sterile E Online NetworkTutorials EO Blue bar must enter area Understanding the Differences Between Class 5 s and Class 6 Emulating Indicators
More informationMachine Learning for Data Science (CS4786) Lecture 1
Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:
More informationMultiple Location Profiling for Users and Relationships from Social Network and Content
Multiple Location Profiling for Users and Relationships from Social Network and Content Rui Li ruili1@illinois.edu Shengjie Wang wang260@illinois.edu Kevin Chen-Chuan Chang, kcchang@illinois.edu Department
More informationBayes and Naïve Bayes. cs534-machine Learning
Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule
More information1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
More informationA neo4j powered social networking and Question & Answer application to enhance scientific communication. René Pickhardt, Heinrich Hartmann
A neo4j powered social networking and Question & Answer application to enhance scientific communication. René Pickhardt, Heinrich Hartmann related-work.net Roadmap Introduction Data structures for Q &
More informationTowards running complex models on big data
Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation
More informationLearning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu
Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of
More informationA Survey of Document Clustering Techniques & Comparison of LDA and movmf
A Survey of Document Clustering Techniques & Comparison of LDA and movmf Yu Xiao December 10, 2010 Abstract This course project is mainly an overview of some widely used document clustering techniques.
More informationFINDING EXPERT USERS IN COMMUNITY QUESTION ANSWERING SERVICES USING TOPIC MODELS
FINDING EXPERT USERS IN COMMUNITY QUESTION ANSWERING SERVICES USING TOPIC MODELS by Fatemeh Riahi Submitted in partial fulfillment of the requirements for the degree of Master of Computer Science at Dalhousie
More informationVisipedia: collaborative harvesting and organization of visual knowledge
Visipedia: collaborative harvesting and organization of visual knowledge Pietro Perona California Institute of Technology 2014 Signal and Image Sciences Workshop Lawrence Livermore National Laboratory
More informationA Regime-Switching Relative Value Arbitrage Rule
A Regime-Switching Relative Value Arbitrage Rule Michael Bock and Roland Mestel University of Graz, Institute for Banking and Finance Universitaetsstrasse 15/F2, A-8010 Graz, Austria {michael.bock,roland.mestel}@uni-graz.at
More informationModeling General and Specific Aspects of Documents with a Probabilistic Topic Model
Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model Chaitanya Chemudugunta, Padhraic Smyth Department of Computer Science University of California, Irvine Irvine, CA 92697-3435,
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationExponential Family Harmoniums with an Application to Information Retrieval
Exponential Family Harmoniums with an Application to Information Retrieval Max Welling & Michal Rosen-Zvi Information and Computer Science University of California Irvine CA 92697-3425 USA welling@ics.uci.edu
More informationSwitching from PC SAS to SAS Enterprise Guide Zhengxin (Cindy) Yang, inventiv Health Clinical, Princeton, NJ
PharmaSUG 2014 PO10 Switching from PC SAS to SAS Enterprise Guide Zhengxin (Cindy) Yang, inventiv Health Clinical, Princeton, NJ ABSTRACT As more and more organizations adapt to the SAS Enterprise Guide,
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationBig Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify
Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)
More informationChapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
More informationProject Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science
Data Intensive Computing CSE 486/586 Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING Masters in Computer Science University at Buffalo Website: http://www.acsu.buffalo.edu/~mjalimin/
More informationLarge-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce
More informationThis can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.
One-Degree-of-Freedom Tests Test for group occasion interactions has (number of groups 1) number of occasions 1) degrees of freedom. This can dilute the significance of a departure from the null hypothesis.
More informationSearch Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc
Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,
More informationSelf-Adjusting Condition Monitoring of PV Power Plant Production
Self-Adjusting Condition Monitoring of PV Power Plant Production Mining Measurement Data Dipl.-Inf. Steffen Dienst M.Sc. Soheila Sahami M.Sc. Johannes Schmidt {sdienst sahami}@informatik.uni-leipzig.de
More informationChenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu)
Paper Author (s) Chenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu) Lei Zhang, University of Maryland, College Park (lei@umd.edu) Paper Title & Number Dynamic Travel
More informationThe Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
More informationCompleting the Big Data Ecosystem:
Completing the Big Data Ecosystem: in sqrrl data INC. August 3, 2012 Design Drivers in Analysis of big data is central to our customers requirements, in which the strongest drivers are: Scalability: The
More informationContinuous-Time Converter Architectures for Integrated Audio Processors: By Brian Trotter, Cirrus Logic, Inc. September 2008
Continuous-Time Converter Architectures for Integrated Audio Processors: By Brian Trotter, Cirrus Logic, Inc. September 2008 As consumer electronics devices continue to both decrease in size and increase
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction
More informationA semi-supervised Spam mail detector
A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach
More informationData Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
More informationIncorporating Window-Based Passage-Level Evidence in Document Retrieval
Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological
More informationEvidence-based Software Process Recovery
Evidence-based Software Process Recovery by Abram Hindle A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science
More informationSocial Media Mining. Network Measures
Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationVisualizing Software Projects in JavaScript
Visualizing Software Projects in JavaScript Tim Disney Abstract Visualization techniques have been used to help programmers deepen their understanding of large software projects. However the existing visualization
More informationUSING RUST TO BUILD THE NEXT GENERATION WEB BROWSER
USING RUST TO BUILD THE NEXT GENERATION WEB BROWSER Lars Bergstrom Mozilla Research Mike Blumenkrantz Samsung R&D America Why a new web engine? Support new types of applications and new devices All modern
More informationSheeba J.I1, Vivekanandan K2
IMPROVED UNSUPERVISED FRAMEWORK FOR SOLVING SYNONYM, HOMONYM, HYPONYMY & POLYSEMY PROBLEMS FROM EXTRACTED KEYWORDS AND IDENTIFY TOPICS IN MEETING TRANSCRIPTS Sheeba J.I1, Vivekanandan K2 1 Assistant Professor,sheeba@pec.edu
More information