Inference Methods for Analyzing the Hidden Semantics in Big Data Phuong LE-HONG phuonglh@gmail.com
Introduction Grant proposal for basic research project Nafosted, 2014 24 months Principal Investigator: KhoatTQ, SoICT, HUST June 2014 Nafosted Proposal 2
Goal Develop a class of inference algorithms that enable us to explore and discover hidden structures (semantics) from massive text collections; to do accurate predictions in practical applications June 2014 Nafosted Proposal 3
Methodologies Key directions in Distributed Processing and Machine Learning: Topic modeling (Blei, 2012) Matrix factorization (Lee & Sung, 1999) Online learning (Hazan & Kale, 2012) Stochastic inference (Hoffman et al., 2013) June 2014 Nafosted Proposal 4
Applications Develop efficient methods for Question answering Text and web mining Recommendation systems Social network analysis June 2014 Nafosted Proposal 5
Literature Review Inferring hidden structures from data is an attractive research topic with many applications: Exploration of a century of scientific journals (Mimno, 2012; Blei & Lafferty, 2007) Exploration of a century of literature (Jockers & Mimno, 2013) Exploration of online forums/networks (Cao et al., 2011; Gerrish & Blei, 2012; Sun & Lin, 2013) Analyzing political opinions from online forums (Cao et al., 2011; Gerrish & Blei, 2012; Grimmer, 2010; Levy & Franklin, 2013) Analyzing behaviors and interests of online users (Gerrish & Blei, 2012; Sun & Lin, 2013; Wang et al., 2011) June 2014 Nafosted Proposal 6
Literature Review Many approaches: Bayesian networks (Darwiche, 2010) Gaussian graphical models (Hsieh et al., 2013) Topic modeling (Hofmann, 2001; Blei, 2012), Non-negative matrix factorization (NMF) (Lee & Seung, 1999; Wang et al., 2011) This project will use topic modeling and NMF as the main ways to develop efficient methods for analyzing big text collections. June 2014 Nafosted Proposal 7
Literature Review Inference for a document: Estimation of variables that are hidden in that document (topics, entities, entity relations) Inference for a dataset: Learning of the hidden structures (topics, topical networks, social communities, user trends) Inference is NP-hard (Sontag & Roy, 2011) June 2014 Nafosted Proposal 8
Literature Review Various methods for efficient inference have been proposed: Maximum likelihood estimation (ML) (Hofmann, 2001) Variational Bayesian (VB) (Blei et al., 2003) Collapsed variational Bayesian (CVB) (Asuncion et al., 2009) Collapsed Gibbs sampling (CGS) (Griffiths & Steyvers, 2004) Maximum a posteriori estimation (MAP) (Chien & Wu, 2008) June 2014 Nafosted Proposal 9
Literature Review Some remarks: Sampling-based methods are guaranteed to converge to the underlying distributions, but with unknown rate. VB and CVB are much faste CVB0 (Asuncion et al., 2009) often performs the best. June 2014 Nafosted Proposal 10
Literature Review Over 20 years of development, many open problems. Accuracy of inferring a model from data Attacked by (Arora et al., 2012; Arora et al., 2013; Anandkumar et al., 2012), breakthrough results; But those results are limited to some restricted models under certain conditions. A large class of topic models and NMF still lack a theoretical guarantee. And those results do not cover inference for individual document. June 2014 Nafosted Proposal 11
Literature Review Previous works on processing big data collections: Focus mainly on utilizing parallel/distributed architectures Works well with million documents; Two main limitations: LDA models are dense, which might consume huge memory when the domain dimension is very large; Existing methods for inferring individual documents do not have any theoretical guarantee for neither inference quality nor inference time. June 2014 Nafosted Proposal 12
Five Problems P1: Can we develop a fast inference method that has provably theoretical guarantees on quality? P2: How can we learn a big topic model from big data? P3: Can we develop methods with provable guarantees on quality for handling streaming/dynamic text collections? June 2014 Nafosted Proposal 13
Five Problems P4: Can we develop an optimized big data processing framework to handle massive distributed computations of inference methods? P5: How can the hidden semantics recovered by our inference methods be useful in fundamental problems of NLP and IR? QA Text and web mining Recommendation June 2014 Nafosted Proposal 14
Three Groups Inference methods: TQ. Khoat, NK. Anh, NV. Linh P1, P2, P3 Large-scale computation: TV. Trung, NB. Minh, TQ. Khoat P3, P4 Applications: LH. Phuong, NV. Linh, NK. Anh, TQ. Khoat P1, P5 June 2014 Nafosted Proposal 15
Expected Results A fast inference method that has a theoretical guarantee on quality and is general enough to be easily employed in a large class of statistical models A family of methods for analyzing the hidden structures/semantics in text collections and nonnegative data A provably fast method that enables us to work with streaming/dynamic text collections and non-negative data. June 2014 Nafosted Proposal 16
Expected Results A new theory that enables us to design fast algorithms for non-convex inference problems, which appear in a large class of probabilistic models New effective methods for practical applications such as question answering, text & web mining, recommendation, social network analysis June 2014 Nafosted Proposal 17
Expected Results Publications: Articles in ISI-covered journals: 2 National/International conferences: 5 Training results: Masters: 2 PhD: 3 June 2014 Nafosted Proposal 18