Inference Methods for Analyzing the Hidden Semantics in Big Data. Phuong LE-HONG phuonglh@gmail.com



Similar documents
Data Mining Yelp Data - Predicting rating stars from review text

Online Courses Recommendation based on LDA

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Prediction of Heart Disease Using Naïve Bayes Algorithm

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Latent Dirichlet Markov Allocation for Sentiment Analysis

Multilingual Rules for Spam Detection

Machine Learning over Big Data

Machine Learning and Statistics: What s the Connection?

Topical Authority Identification in Community Question Answering

CSCI-599 Advanced Big Data Analytics

Decision Support System For A Customer Relationship Management Case Study

Learning outcomes. Knowledge and understanding. Competence and skills

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Bayesian Statistics: Indian Buffet Process

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Statistical Machine Learning from Data

Big learning: challenges and opportunities

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

Machine Learning for Data Science (CS4786) Lecture 1

Navigating the Local Modes of Big Data: The Case of. Topic Models

Mining Topics in Documents Standing on the Shoulders of Big Data. Zhiyuan (Brett) Chen and Bing Liu

On Smoothing and Inference for Topic Models

XML enabled databases. Non relational databases. Guido Rotondi

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Bayesian networks - Time-series models - Apache Spark & Scala

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

Probabilistic topic models for sentiment analysis on the Web

Data Mining and Machine Learning in Bioinformatics

Machine Learning Logistic Regression

The Data Mining Process

Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data

Research Methods Courses

PREA: Personalized Recommendation Algorithms Toolkit

Dirichlet Processes A gentle tutorial

Topic models for Sentiment analysis: A Literature Survey

Statistics Graduate Courses

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

How To Cluster

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

Learning to Suggest Questions in Online Forums

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle

Statistical Machine Translation: IBM Models 1 and 2

STA 4273H: Statistical Machine Learning

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Bayesian Predictive Profiles with Applications to Retail Transaction Data

SMTP: Stedelijk Museum Text Mining Project

Detecting client-side e-banking fraud using a heuristic model

High Productivity Data Processing Analytics Methods with Applications

MapReduce Approach to Collective Classification for Networks

Massive Cloud Auditing using Data Mining on Hadoop

Bayesian Factorization Machines

Identifying Focus, Techniques and Domain of Scientific Papers

Dissertation TOPIC MODELS FOR IMAGE RETRIEVAL ON LARGE-SCALE DATABASES. Eva Hörster

Text Analytics. A business guide

Stock Option Pricing Using Bayes Filters

The Basics of Graphical Models

Distributed forests for MapReduce-based machine learning

High Performance Matrix Inversion with Several GPUs

A Bayesian Topic Model for Spam Filtering

Online Optimization and Personalization of Teaching Sequences

MACHINE LEARNING BASICS WITH R

An Introduction to Data Mining

Interoperability and Analytics February 29, 2016

CHAPTER 1 INTRODUCTION

Tensor Factorization for Multi-Relational Learning

Advanced analytics at your hands

Machine learning in financial forecasting. Haindrich Henrietta Vezér Evelin

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce

Role Description. Position of a Data Scientist Machine Learning at Fractal Analytics

HT2015: SC4 Statistical Data Mining and Machine Learning

Sanjeev Kumar. contribute

A Stock Trading Algorithm Model Proposal, based on Technical Indicators Signals

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Master of Science in Computer Science

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

SPATIAL DATA CLASSIFICATION AND DATA MINING

Doctor of Philosophy in Computer Science

A Hybrid Neural Network-Latent Topic Model

A Statistical Text Mining Method for Patent Analysis

Towards running complex models on big data

Applied mathematics and mathematical statistics

a Data Science Univ. Piraeus [GR]

Transcription:

Inference Methods for Analyzing the Hidden Semantics in Big Data Phuong LE-HONG phuonglh@gmail.com

Introduction Grant proposal for basic research project Nafosted, 2014 24 months Principal Investigator: KhoatTQ, SoICT, HUST June 2014 Nafosted Proposal 2

Goal Develop a class of inference algorithms that enable us to explore and discover hidden structures (semantics) from massive text collections; to do accurate predictions in practical applications June 2014 Nafosted Proposal 3

Methodologies Key directions in Distributed Processing and Machine Learning: Topic modeling (Blei, 2012) Matrix factorization (Lee & Sung, 1999) Online learning (Hazan & Kale, 2012) Stochastic inference (Hoffman et al., 2013) June 2014 Nafosted Proposal 4

Applications Develop efficient methods for Question answering Text and web mining Recommendation systems Social network analysis June 2014 Nafosted Proposal 5

Literature Review Inferring hidden structures from data is an attractive research topic with many applications: Exploration of a century of scientific journals (Mimno, 2012; Blei & Lafferty, 2007) Exploration of a century of literature (Jockers & Mimno, 2013) Exploration of online forums/networks (Cao et al., 2011; Gerrish & Blei, 2012; Sun & Lin, 2013) Analyzing political opinions from online forums (Cao et al., 2011; Gerrish & Blei, 2012; Grimmer, 2010; Levy & Franklin, 2013) Analyzing behaviors and interests of online users (Gerrish & Blei, 2012; Sun & Lin, 2013; Wang et al., 2011) June 2014 Nafosted Proposal 6

Literature Review Many approaches: Bayesian networks (Darwiche, 2010) Gaussian graphical models (Hsieh et al., 2013) Topic modeling (Hofmann, 2001; Blei, 2012), Non-negative matrix factorization (NMF) (Lee & Seung, 1999; Wang et al., 2011) This project will use topic modeling and NMF as the main ways to develop efficient methods for analyzing big text collections. June 2014 Nafosted Proposal 7

Literature Review Inference for a document: Estimation of variables that are hidden in that document (topics, entities, entity relations) Inference for a dataset: Learning of the hidden structures (topics, topical networks, social communities, user trends) Inference is NP-hard (Sontag & Roy, 2011) June 2014 Nafosted Proposal 8

Literature Review Various methods for efficient inference have been proposed: Maximum likelihood estimation (ML) (Hofmann, 2001) Variational Bayesian (VB) (Blei et al., 2003) Collapsed variational Bayesian (CVB) (Asuncion et al., 2009) Collapsed Gibbs sampling (CGS) (Griffiths & Steyvers, 2004) Maximum a posteriori estimation (MAP) (Chien & Wu, 2008) June 2014 Nafosted Proposal 9

Literature Review Some remarks: Sampling-based methods are guaranteed to converge to the underlying distributions, but with unknown rate. VB and CVB are much faste CVB0 (Asuncion et al., 2009) often performs the best. June 2014 Nafosted Proposal 10

Literature Review Over 20 years of development, many open problems. Accuracy of inferring a model from data Attacked by (Arora et al., 2012; Arora et al., 2013; Anandkumar et al., 2012), breakthrough results; But those results are limited to some restricted models under certain conditions. A large class of topic models and NMF still lack a theoretical guarantee. And those results do not cover inference for individual document. June 2014 Nafosted Proposal 11

Literature Review Previous works on processing big data collections: Focus mainly on utilizing parallel/distributed architectures Works well with million documents; Two main limitations: LDA models are dense, which might consume huge memory when the domain dimension is very large; Existing methods for inferring individual documents do not have any theoretical guarantee for neither inference quality nor inference time. June 2014 Nafosted Proposal 12

Five Problems P1: Can we develop a fast inference method that has provably theoretical guarantees on quality? P2: How can we learn a big topic model from big data? P3: Can we develop methods with provable guarantees on quality for handling streaming/dynamic text collections? June 2014 Nafosted Proposal 13

Five Problems P4: Can we develop an optimized big data processing framework to handle massive distributed computations of inference methods? P5: How can the hidden semantics recovered by our inference methods be useful in fundamental problems of NLP and IR? QA Text and web mining Recommendation June 2014 Nafosted Proposal 14

Three Groups Inference methods: TQ. Khoat, NK. Anh, NV. Linh P1, P2, P3 Large-scale computation: TV. Trung, NB. Minh, TQ. Khoat P3, P4 Applications: LH. Phuong, NV. Linh, NK. Anh, TQ. Khoat P1, P5 June 2014 Nafosted Proposal 15

Expected Results A fast inference method that has a theoretical guarantee on quality and is general enough to be easily employed in a large class of statistical models A family of methods for analyzing the hidden structures/semantics in text collections and nonnegative data A provably fast method that enables us to work with streaming/dynamic text collections and non-negative data. June 2014 Nafosted Proposal 16

Expected Results A new theory that enables us to design fast algorithms for non-convex inference problems, which appear in a large class of probabilistic models New effective methods for practical applications such as question answering, text & web mining, recommendation, social network analysis June 2014 Nafosted Proposal 17

Expected Results Publications: Articles in ISI-covered journals: 2 National/International conferences: 5 Training results: Masters: 2 PhD: 3 June 2014 Nafosted Proposal 18