Inference Methods for Analyzing the Hidden Semantics in Big Data. Phuong LE-HONG

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Inference Methods for Analyzing the Hidden Semantics in Big Data. Phuong LE-HONG phuonglh@gmail.com"

Transcription

1 Inference Methods for Analyzing the Hidden Semantics in Big Data Phuong LE-HONG

2 Introduction Grant proposal for basic research project Nafosted, months Principal Investigator: KhoatTQ, SoICT, HUST June 2014 Nafosted Proposal 2

3 Goal Develop a class of inference algorithms that enable us to explore and discover hidden structures (semantics) from massive text collections; to do accurate predictions in practical applications June 2014 Nafosted Proposal 3

4 Methodologies Key directions in Distributed Processing and Machine Learning: Topic modeling (Blei, 2012) Matrix factorization (Lee & Sung, 1999) Online learning (Hazan & Kale, 2012) Stochastic inference (Hoffman et al., 2013) June 2014 Nafosted Proposal 4

5 Applications Develop efficient methods for Question answering Text and web mining Recommendation systems Social network analysis June 2014 Nafosted Proposal 5

6 Literature Review Inferring hidden structures from data is an attractive research topic with many applications: Exploration of a century of scientific journals (Mimno, 2012; Blei & Lafferty, 2007) Exploration of a century of literature (Jockers & Mimno, 2013) Exploration of online forums/networks (Cao et al., 2011; Gerrish & Blei, 2012; Sun & Lin, 2013) Analyzing political opinions from online forums (Cao et al., 2011; Gerrish & Blei, 2012; Grimmer, 2010; Levy & Franklin, 2013) Analyzing behaviors and interests of online users (Gerrish & Blei, 2012; Sun & Lin, 2013; Wang et al., 2011) June 2014 Nafosted Proposal 6

7 Literature Review Many approaches: Bayesian networks (Darwiche, 2010) Gaussian graphical models (Hsieh et al., 2013) Topic modeling (Hofmann, 2001; Blei, 2012), Non-negative matrix factorization (NMF) (Lee & Seung, 1999; Wang et al., 2011) This project will use topic modeling and NMF as the main ways to develop efficient methods for analyzing big text collections. June 2014 Nafosted Proposal 7

8 Literature Review Inference for a document: Estimation of variables that are hidden in that document (topics, entities, entity relations) Inference for a dataset: Learning of the hidden structures (topics, topical networks, social communities, user trends) Inference is NP-hard (Sontag & Roy, 2011) June 2014 Nafosted Proposal 8

9 Literature Review Various methods for efficient inference have been proposed: Maximum likelihood estimation (ML) (Hofmann, 2001) Variational Bayesian (VB) (Blei et al., 2003) Collapsed variational Bayesian (CVB) (Asuncion et al., 2009) Collapsed Gibbs sampling (CGS) (Griffiths & Steyvers, 2004) Maximum a posteriori estimation (MAP) (Chien & Wu, 2008) June 2014 Nafosted Proposal 9

10 Literature Review Some remarks: Sampling-based methods are guaranteed to converge to the underlying distributions, but with unknown rate. VB and CVB are much faste CVB0 (Asuncion et al., 2009) often performs the best. June 2014 Nafosted Proposal 10

11 Literature Review Over 20 years of development, many open problems. Accuracy of inferring a model from data Attacked by (Arora et al., 2012; Arora et al., 2013; Anandkumar et al., 2012), breakthrough results; But those results are limited to some restricted models under certain conditions. A large class of topic models and NMF still lack a theoretical guarantee. And those results do not cover inference for individual document. June 2014 Nafosted Proposal 11

12 Literature Review Previous works on processing big data collections: Focus mainly on utilizing parallel/distributed architectures Works well with million documents; Two main limitations: LDA models are dense, which might consume huge memory when the domain dimension is very large; Existing methods for inferring individual documents do not have any theoretical guarantee for neither inference quality nor inference time. June 2014 Nafosted Proposal 12

13 Five Problems P1: Can we develop a fast inference method that has provably theoretical guarantees on quality? P2: How can we learn a big topic model from big data? P3: Can we develop methods with provable guarantees on quality for handling streaming/dynamic text collections? June 2014 Nafosted Proposal 13

14 Five Problems P4: Can we develop an optimized big data processing framework to handle massive distributed computations of inference methods? P5: How can the hidden semantics recovered by our inference methods be useful in fundamental problems of NLP and IR? QA Text and web mining Recommendation June 2014 Nafosted Proposal 14

15 Three Groups Inference methods: TQ. Khoat, NK. Anh, NV. Linh P1, P2, P3 Large-scale computation: TV. Trung, NB. Minh, TQ. Khoat P3, P4 Applications: LH. Phuong, NV. Linh, NK. Anh, TQ. Khoat P1, P5 June 2014 Nafosted Proposal 15

16 Expected Results A fast inference method that has a theoretical guarantee on quality and is general enough to be easily employed in a large class of statistical models A family of methods for analyzing the hidden structures/semantics in text collections and nonnegative data A provably fast method that enables us to work with streaming/dynamic text collections and non-negative data. June 2014 Nafosted Proposal 16

17 Expected Results A new theory that enables us to design fast algorithms for non-convex inference problems, which appear in a large class of probabilistic models New effective methods for practical applications such as question answering, text & web mining, recommendation, social network analysis June 2014 Nafosted Proposal 17

18 Expected Results Publications: Articles in ISI-covered journals: 2 National/International conferences: 5 Training results: Masters: 2 PhD: 3 June 2014 Nafosted Proposal 18

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Online Courses Recommendation based on LDA

Online Courses Recommendation based on LDA Online Courses Recommendation based on LDA Rel Guzman Apaza, Elizabeth Vera Cervantes, Laura Cruz Quispe, José Ochoa Luna National University of St. Agustin Arequipa - Perú {r.guzmanap,elizavvc,lvcruzq,eduardo.ol}@gmail.com

More information

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of

More information

Prediction of Heart Disease Using Naïve Bayes Algorithm

Prediction of Heart Disease Using Naïve Bayes Algorithm Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,

More information

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin Network Big Data: Facing and Tackling the Complexities Xiaolong Jin CAS Key Laboratory of Network Data Science & Technology Institute of Computing Technology Chinese Academy of Sciences (CAS) 2015-08-10

More information

I, ONLINE KNOWLEDGE TRANSFER FREELANCER PORTAL M. & A.

I, ONLINE KNOWLEDGE TRANSFER FREELANCER PORTAL M. & A. ONLINE KNOWLEDGE TRANSFER FREELANCER PORTAL M. Micheal Fernandas* & A. Anitha** * PG Scholar, Department of Master of Computer Applications, Dhanalakshmi Srinivasan Engineering College, Perambalur, Tamilnadu

More information

Latent Dirichlet Markov Allocation for Sentiment Analysis

Latent Dirichlet Markov Allocation for Sentiment Analysis Latent Dirichlet Markov Allocation for Sentiment Analysis Ayoub Bagheri Isfahan University of Technology, Isfahan, Iran Intelligent Database, Data Mining and Bioinformatics Lab, Electrical and Computer

More information

Machine Learning and Statistics: What s the Connection?

Machine Learning and Statistics: What s the Connection? Machine Learning and Statistics: What s the Connection? Institute for Adaptive and Neural Computation School of Informatics, University of Edinburgh, UK August 2006 Outline The roots of machine learning

More information

Machine Learning over Big Data

Machine Learning over Big Data Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed

More information

Multilingual Rules for Spam Detection

Multilingual Rules for Spam Detection Multilingual Rules for Spam Detection Minh Tuan Vu 1, Quang Anh Tran 1, Frank Jiang 2 and Van Quan Tran 1 1 Faculty of Information Technology, Hanoi University, Hanoi, Vietnam 2 School of Engineering and

More information

Topical Authority Identification in Community Question Answering

Topical Authority Identification in Community Question Answering Topical Authority Identification in Community Question Answering Guangyou Zhou, Kang Liu, and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences 95

More information

Decision Support System For A Customer Relationship Management Case Study

Decision Support System For A Customer Relationship Management Case Study 61 Decision Support System For A Customer Relationship Management Case Study Ozge Kart 1, Alp Kut 1, and Vladimir Radevski 2 1 Dokuz Eylul University, Izmir, Turkey {ozge, alp}@cs.deu.edu.tr 2 SEE University,

More information

When scientists decide to write a paper, one of the first

When scientists decide to write a paper, one of the first Colloquium Finding scientific topics Thomas L. Griffiths* and Mark Steyvers *Department of Psychology, Stanford University, Stanford, CA 94305; Department of Brain and Cognitive Sciences, Massachusetts

More information

CSCI-599 Advanced Big Data Analytics

CSCI-599 Advanced Big Data Analytics CSCI-599 Advanced Big Data Analytics 1. Basic Information Course: Advanced Data Analytics, CSCI-599 Place and time: TBA, Wed 2:00-4:40pm/ Fall Instructor: Yan Liu Assistant Professor of Computer Science

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

More information

Probabilistic. review articles. Surveying a suite of algorithms that offer a solution to managing large document archives.

Probabilistic. review articles. Surveying a suite of algorithms that offer a solution to managing large document archives. doi:10.1145/2133806.2133826 Surveying a suite of algorithms that offer a solution to managing large document archives. by David M. Blei Probabilistic Topic Models As our collective knowledge continues

More information

Learning outcomes. Knowledge and understanding. Ability and Competences. Evaluation capability and scientific approach

Learning outcomes. Knowledge and understanding. Ability and Competences. Evaluation capability and scientific approach Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

01219211 Software Development Training Camp 1 (0-3) Prerequisite : 01204214 Program development skill enhancement camp, at least 48 person-hours.

01219211 Software Development Training Camp 1 (0-3) Prerequisite : 01204214 Program development skill enhancement camp, at least 48 person-hours. (International Program) 01219141 Object-Oriented Modeling and Programming 3 (3-0) Object concepts, object-oriented design and analysis, object-oriented analysis relating to developing conceptual models

More information

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

The Exponential Family

The Exponential Family The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

More information

Combining Data Mining and Neural Network Algorithm to Provide Prediction Framework for Gold Price Work on DDBS or Big Data Resources

Combining Data Mining and Neural Network Algorithm to Provide Prediction Framework for Gold Price Work on DDBS or Big Data Resources UNIVERSITY OF SCIENCE AND TECHNOLOGY COLLEGE OF GRADUATE STUDIES AND ACADEMIC ADVANCEMENT Faculty of Computer Science and Information Technology Combining Data Mining and Neural Network Algorithm to Provide

More information

Big learning: challenges and opportunities

Big learning: challenges and opportunities Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,

More information

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli

More information

Machine Learning for Data Science (CS4786) Lecture 1

Machine Learning for Data Science (CS4786) Lecture 1 Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:

More information

15.00 15.30 30 XML enabled databases. Non relational databases. Guido Rotondi

15.00 15.30 30 XML enabled databases. Non relational databases. Guido Rotondi Programme of the ESTP training course on BIG DATA EFFECTIVE PROCESSING AND ANALYSIS OF VERY LARGE AND UNSTRUCTURED DATA FOR OFFICIAL STATISTICS Rome, 5 9 May 2014 Istat Piazza Indipendenza 4, Room Vanoni

More information

Navigating the Local Modes of Big Data: The Case of. Topic Models

Navigating the Local Modes of Big Data: The Case of. Topic Models Navigating the Local Modes of Big Data: The Case of Topic Models Margaret E. Roberts, Brandon M. Stewart, and Dustin Tingley This draft: June 28, 2015 Prepared for Computational Social Science: Discovery

More information

Mining Topics in Documents Standing on the Shoulders of Big Data. Zhiyuan (Brett) Chen and Bing Liu

Mining Topics in Documents Standing on the Shoulders of Big Data. Zhiyuan (Brett) Chen and Bing Liu Mining Topics in Documents Standing on the Shoulders of Big Data Zhiyuan (Brett) Chen and Bing Liu Topic Models Widely used in many applications Most of them are unsupervised However, topic models Require

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

Variational Inference in Non-negative Factorial Hidden Markov Models for Efficient Audio Source Separation

Variational Inference in Non-negative Factorial Hidden Markov Models for Efficient Audio Source Separation Variational Inference in Non-negative Factorial Hidden Markov Models for Efficient Audio Source Separation Gautham J. Mysore gmysore@adobe.com Advanced Technology Labs, Adobe Systems Inc., San Francisco,

More information

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool. International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

More information

On Smoothing and Inference for Topic Models

On Smoothing and Inference for Topic Models On Smoothing and Inference for Topic Models Arthur Asuncion, Max Welling, Padhraic Smyth Department of Computer Science University of California, Irvine Irvine, CA, USA {asuncion,welling,smyth}@ics.uci.edu

More information

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor

More information

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

More information

Probabilistic topic models for sentiment analysis on the Web

Probabilistic topic models for sentiment analysis on the Web University of Exeter Department of Computer Science Probabilistic topic models for sentiment analysis on the Web Chenghua Lin September 2011 Submitted by Chenghua Lin, to the the University of Exeter as

More information

SMTP: Stedelijk Museum Text Mining Project

SMTP: Stedelijk Museum Text Mining Project SMTP: Stedelijk Museum Text Mining Project Jeroen Smeets Maastricht University smeetsjeroen@hotmail.com Prof. Dr. Ir. Johannes C. Scholtes Maastricht University j.scholtes@maastrichtuniversity.nl Dr. Claartje

More information

Data Mining and Machine Learning in Bioinformatics

Data Mining and Machine Learning in Bioinformatics Data Mining and Machine Learning in Bioinformatics PRINCIPAL METHODS AND SUCCESSFUL APPLICATIONS Ruben Armañanzas http://mason.gmu.edu/~rarmanan Adapted from Iñaki Inza slides http://www.sc.ehu.es/isg

More information

Machine Learning Logistic Regression

Machine Learning Logistic Regression Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

More information

Learning to Read Between the Lines: The Aspect Bernoulli Model

Learning to Read Between the Lines: The Aspect Bernoulli Model Learning to Read Between the Lines: The Aspect Bernoulli Model A. Kabán E. Bingham T. Hirsimäki Abstract We present a novel probabilistic multiple cause model for binary observations. In contrast to other

More information

Big graphs: Theory and Practice, January 6-8, 2016, UC San Diego. Abstracts

Big graphs: Theory and Practice, January 6-8, 2016, UC San Diego. Abstracts Big graphs: Theory and Practice, January 6-8, 2016, UC San Diego Anima Anandkumar (UC Irvine) Abstracts Learning mixed membership community models via spectral methods Abstract: Learning hidden communities

More information

Lecture 11: Graphical Models for Inference

Lecture 11: Graphical Models for Inference Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference - the Bayesian network and the Join tree. These two both represent the same joint probability

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

Learning to Suggest Questions in Online Forums

Learning to Suggest Questions in Online Forums Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence Learning to Suggest Questions in Online Forums Tom Chao Zhou 1, Chin-Yew Lin 2,IrwinKing 3, Michael R. Lyu 1, Young-In Song 2

More information

Research Methods Courses

Research Methods Courses Research Methods Courses ACCTG 501 ADTED 550 ADTED 551 A ED 502 AEE 520 AEE 521 AEREC 510 AEREC 511 APLNG 578 APLNG 581 BB H 505 Research Methods in Accounting Qualitative Research in Adult Ed (Introduction

More information

Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data

Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data Zhiyuan Chen Department of Computer Science, University of Illinois at Chicago Bing Liu Department of Computer Science, University

More information

ISSUES IN RULE BASED KNOWLEDGE DISCOVERING PROCESS

ISSUES IN RULE BASED KNOWLEDGE DISCOVERING PROCESS Advances and Applications in Statistical Sciences Proceedings of The IV Meeting on Dynamics of Social and Economic Systems Volume 2, Issue 2, 2010, Pages 303-314 2010 Mili Publications ISSUES IN RULE BASED

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

PREA: Personalized Recommendation Algorithms Toolkit

PREA: Personalized Recommendation Algorithms Toolkit Journal of Machine Learning Research 13 (2012) 2699-2703 Submitted 7/11; Revised 4/12; Published 9/12 PREA: Personalized Recommendation Algorithms Toolkit Joonseok Lee Mingxuan Sun Guy Lebanon College

More information

Dirichlet Processes A gentle tutorial

Dirichlet Processes A gentle tutorial Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.

More information

A Variational Approximation for Topic Modeling of Hierarchical Corpora

A Variational Approximation for Topic Modeling of Hierarchical Corpora A Variational Approximation for Topic Modeling of Hierarchical Corpora Do-kyum Kim dok027@cs.ucsd.edu Geoffrey M. Voelker voelker@cs.ucsd.edu Lawrence K. Saul saul@cs.ucsd.edu Department of Computer Science

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful

More information

Topic models for Sentiment analysis: A Literature Survey

Topic models for Sentiment analysis: A Literature Survey Topic models for Sentiment analysis: A Literature Survey Nikhilkumar Jadhav 123050033 June 26, 2014 In this report, we present the work done so far in the field of sentiment analysis using topic models.

More information

Analyzing Huge Data Sets in Forensic Investigations

Analyzing Huge Data Sets in Forensic Investigations Analyzing Huge Data Sets in Forensic Investigations Kasun De Zoysa Yasantha Hettiarachi Department of Communication and Media Technologies University of Colombo School of Computing Colombo, Sri Lanka Centre

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

A Statistical Text Mining Method for Patent Analysis

A Statistical Text Mining Method for Patent Analysis A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical

More information

A Stock Trading Algorithm Model Proposal, based on Technical Indicators Signals

A Stock Trading Algorithm Model Proposal, based on Technical Indicators Signals Informatica Economică vol. 15, no. 1/2011 183 A Stock Trading Algorithm Model Proposal, based on Technical Indicators Signals Darie MOLDOVAN, Mircea MOCA, Ştefan NIŢCHI Business Information Systems Dept.

More information

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases Andreas Züfle Geo Spatial Data Huge flood of geo spatial data Modern technology New user mentality Great research potential

More information

Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing

Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing

More information

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network Qian Wu, Yahui Wang, Long Zhang and Li Shen Abstract Building electrical system fault diagnosis is the

More information

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition

More information

Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

More information

Big Data from a Database Theory Perspective

Big Data from a Database Theory Perspective Big Data from a Database Theory Perspective Martin Grohe Lehrstuhl Informatik 7 - Logic and the Theory of Discrete Systems A CS View on Data Science Applications Data System Users 2 Us Data HUGE heterogeneous

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

High Productivity Data Processing Analytics Methods with Applications

High Productivity Data Processing Analytics Methods with Applications High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Detecting client-side e-banking fraud using a heuristic model

Detecting client-side e-banking fraud using a heuristic model Detecting client-side e-banking fraud using a heuristic model Tim Timmermans tim.timmermans@os3.nl Jurgen Kloosterman jurgen.kloosterman@os3.nl University of Amsterdam July 4, 2013 Tim Timmermans, Jurgen

More information

Bayesian Predictive Profiles with Applications to Retail Transaction Data

Bayesian Predictive Profiles with Applications to Retail Transaction Data Bayesian Predictive Profiles with Applications to Retail Transaction Data Igor V. Cadez Information and Computer Science University of California Irvine, CA 92697-3425, U.S.A. icadez@ics.uci.edu Padhraic

More information

A crash course in probability and Naïve Bayes classification

A crash course in probability and Naïve Bayes classification Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

More information

Interoperability, Standards and Open Advancement

Interoperability, Standards and Open Advancement Interoperability, Standards and Open Eric Nyberg 1 Open Shared resources & annotation schemas Shared component APIs Shared datasets (corpora, test sets) Shared software (open source) Shared configurations

More information

Index Terms Cloud Storage Services, data integrity, dependable distributed storage, data dynamics, Cloud Computing.

Index Terms Cloud Storage Services, data integrity, dependable distributed storage, data dynamics, Cloud Computing. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Privacy - Preserving

More information

Bayesian Factorization Machines

Bayesian Factorization Machines Bayesian Factorization Machines Christoph Freudenthaler, Lars Schmidt-Thieme Information Systems & Machine Learning Lab University of Hildesheim 31141 Hildesheim {freudenthaler, schmidt-thieme}@ismll.de

More information

Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard

Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,

More information

Massive Cloud Auditing using Data Mining on Hadoop

Massive Cloud Auditing using Data Mining on Hadoop Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of

More information

Survey on Hybrid Approach for Fraud Detection in Health Insurance

Survey on Hybrid Approach for Fraud Detection in Health Insurance Survey on Hybrid Approach for Fraud Detection in Health Insurance Punam Devidas Bagul, Sachin Bojewar, Ankit Sanghavi M. E Student, Dept. of Computer Science and Engineering, ARMIET, Sapgaon, Mumbai University,

More information

TRENDS IN AN INTERNATIONAL INDUSTRIAL ENGINEERING RESEARCH JOURNAL: A TEXTUAL INFORMATION ANALYSIS PERSPECTIVE. wernervanzyl@sun.ac.

TRENDS IN AN INTERNATIONAL INDUSTRIAL ENGINEERING RESEARCH JOURNAL: A TEXTUAL INFORMATION ANALYSIS PERSPECTIVE. wernervanzyl@sun.ac. TRENDS IN AN INTERNATIONAL INDUSTRIAL ENGINEERING RESEARCH JOURNAL: A TEXTUAL INFORMATION ANALYSIS PERSPECTIVE J.W. Uys 1 *, C.S.L. Schutte 2 and W.D. Van Zyl 3 1 Indutech (Pty) Ltd., Stellenbosch, South

More information

High Performance Matrix Inversion with Several GPUs

High Performance Matrix Inversion with Several GPUs High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República

More information

Stock Option Pricing Using Bayes Filters

Stock Option Pricing Using Bayes Filters Stock Option Pricing Using Bayes Filters Lin Liao liaolin@cs.washington.edu Abstract When using Black-Scholes formula to price options, the key is the estimation of the stochastic return variance. In this

More information

Dissertation TOPIC MODELS FOR IMAGE RETRIEVAL ON LARGE-SCALE DATABASES. Eva Hörster

Dissertation TOPIC MODELS FOR IMAGE RETRIEVAL ON LARGE-SCALE DATABASES. Eva Hörster Dissertation TOPIC MODELS FOR IMAGE RETRIEVAL ON LARGE-SCALE DATABASES Eva Hörster Department of Computer Science University of Augsburg Adviser: Readers: Prof. Dr. Rainer Lienhart Prof. Dr. Rainer Lienhart

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Text Analytics. A business guide

Text Analytics. A business guide Text Analytics A business guide February 2014 Contents 3 The Business Value of Text Analytics 4 What is Text Analytics? 6 Text Analytics Methods 8 Unstructured Meets Structured Data 9 Business Application

More information

A Capability Model for Business Analytics: Part 2 Assessing Analytic Capabilities

A Capability Model for Business Analytics: Part 2 Assessing Analytic Capabilities A Capability Model for Business Analytics: Part 2 Assessing Analytic Capabilities The first article of this series presented the capability model for business analytics that is illustrated in Figure One.

More information

A Bayesian Topic Model for Spam Filtering

A Bayesian Topic Model for Spam Filtering Journal of Information & Computational Science 1:12 (213) 3719 3727 August 1, 213 Available at http://www.joics.com A Bayesian Topic Model for Spam Filtering Zhiying Zhang, Xu Yu, Lixiang Shi, Li Peng,

More information

Date: May 6 (Wednesday), 2015, 14:00 ~ 18:00 Venue: Room No. 201, Engineering Building 2, Yonsei University, Seoul, Korea

Date: May 6 (Wednesday), 2015, 14:00 ~ 18:00 Venue: Room No. 201, Engineering Building 2, Yonsei University, Seoul, Korea Microsoft Research Yonsei University Joint Workshop Date: May 6 (Wednesday), 2015, 14:00 ~ 18:00 Venue: Room No. 201, Engineering Building 2, Yonsei University, Seoul, Korea PROGRAM Time 14:00 ~ 14:10

More information

MACHINE LEARNING BASICS WITH R

MACHINE LEARNING BASICS WITH R MACHINE LEARNING [Hands-on Introduction of Supervised Machine Learning Methods] DURATION 2 DAY The field of machine learning is concerned with the question of how to construct computer programs that automatically

More information

Interoperability and Analytics February 29, 2016

Interoperability and Analytics February 29, 2016 Interoperability and Analytics February 29, 2016 Matthew Hoffman MD, CMIO Utah Health Information Network Conflict of Interest Matthew Hoffman, MD Has no real or apparent conflicts of interest to report.

More information

Online Optimization and Personalization of Teaching Sequences

Online Optimization and Personalization of Teaching Sequences Online Optimization and Personalization of Teaching Sequences Benjamin Clément 1, Didier Roy 1, Manuel Lopes 1, Pierre-Yves Oudeyer 1 1 Flowers Lab Inria Bordeaux Sud-Ouest, Bordeaux 33400, France, didier.roy@inria.fr

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

Tensor Factorization for Multi-Relational Learning

Tensor Factorization for Multi-Relational Learning Tensor Factorization for Multi-Relational Learning Maximilian Nickel 1 and Volker Tresp 2 1 Ludwig Maximilian University, Oettingenstr. 67, Munich, Germany nickel@dbs.ifi.lmu.de 2 Siemens AG, Corporate

More information

Outline. What is Big data and where they come from? How we deal with Big data?

Outline. What is Big data and where they come from? How we deal with Big data? What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,

More information

FOUNDATIONAL SYSTEMS BRIDGING DATA MINING AND VISUAL ANALYTICS

FOUNDATIONAL SYSTEMS BRIDGING DATA MINING AND VISUAL ANALYTICS JAEGUL CHOO RESEARCH STATEMENT My primary research goal is to develop new methods and systems that firmly unify data mining and visual analytics for solving challenging problems in big data. Data mining

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

More information

Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce

Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce Erik B. Reed Carnegie Mellon University Silicon Valley Campus NASA Research Park Moffett Field, CA 94035 erikreed@cmu.edu

More information