High Performance Latent Dirichlet Allocation for Text Mining

Similar documents

Cluster Analysis. Cluster Analysis

DEGREES OF EQUIVALENCE IN A KEY COMPARISON 1 Thang H. L., Nguyen D. D. Vietnam Metrology Institute, Address: 8 Hoang Quoc Viet, Hanoi, Vietnam

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Improved SVM in Cloud Computing Information Mining

Forecasting the Direction and Strength of Stock Market Movement

The Design of Efficiently-Encodable Rate-Compatible LDPC Codes

On the Optimal Marginal Rate of Income Tax

Calculating the high frequency transmission line parameters of power cables

Exact GP Schema Theory for Headless Chicken Crossover and Subtree Mutation

EXAMPLE PROBLEMS SOLVED USING THE SHARP EL-733A CALCULATOR

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

A Binary Quantum-behaved Particle Swarm Optimization Algorithm with Cooperative Approach

A Programming Model for the Cloud Platform

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

What is Candidate Sampling

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

A Practical Study of Regenerating Codes for Peer-to-Peer Backup Systems

APPLICATION OF BINARY DIVISION ALGORITHM FOR IMAGE ANALYSIS AND CHANGE DETECTION TO IDENTIFY THE HOTSPOTS IN MODIS IMAGES

Title Language Model for Information Retrieval

A Load-Balancing Algorithm for Cluster-based Multi-core Web Servers

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

RSA Cryptography using Designed Processor and MicroBlaze Soft Processor in FPGAs

An Efficient Recovery Algorithm for Coverage Hole in WSNs

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Multiple-Period Attribution: Residuals and Compounding

An Interest-Oriented Network Evolution Mechanism for Online Communities

A Secure Password-Authenticated Key Agreement Using Smart Cards

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

THE LOAD PLANNING PROBLEM FOR LESS-THAN-TRUCKLOAD MOTOR CARRIERS AND A SOLUTION APPROACH. Professor Naoto Katayama* and Professor Shigeru Yurimoto*

Open Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

A novel Method for Data Mining and Classification based on

On the computation of the capital multiplier in the Fortis Credit Economic Capital model

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

An Alternative Way to Measure Private Equity Performance

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

Exploiting Recommendation on Social Media Networks

IMPACT ANALYSIS OF A CELLULAR PHONE

Credit Limit Optimization (CLO) for Credit Cards

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

Development of an intelligent system for tool wear monitoring applying neural networks

Trust Network and Trust Community Clustering based on Shortest Path Analysis for E-commerce

A heuristic task deployment approach for load balancing

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Traffic State Estimation in the Traffic Management Center of Berlin

Realistic Image Synthesis

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

LITERATURE REVIEW: VARIOUS PRIORITY BASED TASK SCHEDULING ALGORITHMS IN CLOUD COMPUTING

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Implementation of Deutsch's Algorithm Using Mathcad

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

A DATA MINING APPLICATION IN A STUDENT DATABASE

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Methodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications

Ants Can Schedule Software Projects

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Probabilistic Latent Semantic User Segmentation for Behavioral Targeted Advertising*

Overview of monitoring and evaluation

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

A Simple Approach to Clustering in Excel

A General and Practical Datacenter Selection Framework for Cloud Services

Enterprise Master Patient Index

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Activity Scheduling for Cost-Time Investment Optimization in Project Management

Searching for Interacting Features for Spam Filtering

IWFMS: An Internal Workflow Management System/Optimizer for Hadoop

BERNSTEIN POLYNOMIALS

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

Parallel Numerical Simulation of Visual Neurons for Analysis of Optical Illusion

iavenue iavenue i i i iavenue iavenue iavenue

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Present Values and Accumulations

Gender Classification for Real-Time Audience Analysis System

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Data Mining from the Information Systems: Performance Indicators at Masaryk University in Brno

SCHEDULING OF CONSTRUCTION PROJECTS BY MEANS OF EVOLUTIONARY ALGORITHMS

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

Transcription:

Hgh Performance Latent Drchlet Allocaton for Text Mnng A thess submtte for Degree of Doctor of Phlosophy By Department of Electronc an Computer Engneerng School of Engneerng an Desgn Brunel Unversty September 2013

Abstract Latent Drchlet Allocaton (LDA), a total probablty generatve moel, s a three-ter Bayesan moel. LDA computes the latent topc structure of the ata an obtans the sgnfcant nformaton of ocuments. However, tratonal LDA has several lmtatons n practcal applcatons. LDA cannot be rectly use n classfcaton because t s a non-supervse learnng moel. It nees to be embee nto approprate classfcaton algorthms. LDA s a generatve moel as t normally generates the latent topcs n the categores where the target ocuments o not belong to, proucng the evaton n computaton an reucng the classfcaton accuracy. The number of topcs n LDA nfluences the learnng process of moel parameters greatly. Nose samples n the tranng ata also affect the fnal text classfcaton result. An, the qualty of LDA base classfers epens on the qualty of the tranng samples to a great extent. Although parallel LDA algorthms are propose to eal wth huge amounts of ata, balancng computng loas n a computer cluster poses another challenge. Ths thess presents a text classfcaton metho whch combnes the LDA moel an Support Vector Machne (SVM) classfcaton algorthm for an mprove accuracy n classfcaton when reucng the menson of atasets. Base on Densty-Base Spatal Clusterng of Applcatons wth Nose (DBSCAN), the algorthm automatcally optmzes the number of topcs to be selecte whch reuces the number of teratons n computaton. Furthermore, ths thess presents a nose ata reucton scheme to process nose ata. When the nose rato s large n the tranng ata set, the nose reucton scheme can always prouce a hgh level of accuracy n classfcaton. Fnally, the thess parallelzes LDA usng the MapReuce moel whch s the e facto computng stanar n supportng ata ntensve applcatons. A genetc algorthm base loa balancng algorthm s esgne to balance the workloas among computers n a heterogeneous MapReuce cluster where the computers have a varety of computng resources n terms of CPU spee, memory space an har sk space.

Acknowlegement I woul thank many people for ther help. Frst of all, I woul lke to thank my PhD supervsor, Prof. Maozhen L. In the process of the whole research, he always gave me the most help an guance from en to en. In aton, I have been encourage greatly wth hs avce an support so that I coul face all the ffcultes. Not only I learn more about my research, but also I learne how to analyze an solve the problem. I also thank Xaoyu Chen, Yang Lu, Yu Zhao. They gave me lots of sgnfcant opnons. Especally ther support an care n the usual lfe. Moreover, I stll woul thank the School of Engneerng an Desgn, an Brunel Unversty. Durng my PhD research years, I acheve all the aspects of the help from the School an the Unversty. Fnally, I thank my parents, grlfren an housemates. They gave me the greatest support an courage when I was n the most ffcult tme. Here, I woul lke to express my heartfelt thanks to all the frens who have helpe me. I shall remember the ays I spent at Brunel forever.

Author s Declaraton The work escrbe n ths thess has not been prevously submtte for a egree n ths or any other unversty an unless otherwse reference t s the author s own work.

Statement of Copyrght The copyrght of ths thess rests wth the author. No quotaton from t shoul be publshe wthout hs pror wrtten consent an nformaton erve from t shoul be acknowlege. v

Lst of Abbrevatons AD-LDA API BP CTM DAG DBSCAN DLB DP EM FIFO GA GFS GS HDFS HD-LDA HDP ICE Intel TBB JDK KLD Approxmate Dstrbute LDA Applcaton Program Interface Belef Propagaton Correlate Topc Moel Drecte Acyclc Graph Densty-Base Spatal Clusterng of Applcatons wth Nose Dynamc Loa Balancng Drchlet Process Expectaton-Maxmzaton Frst n Frst out Genetc Algorthm Google Fle System Gbbs Samplng Haoop Dstrbute Fle System Herarchcal Dstrbute LDA Herarchcal Drchlet Process Internet Communcatons Engne Intel Threang Bulng Blocks Java Development Kt Kullback-Lebler Dvergence v

K-NN LDA LSI MCMC MPI PAM PLDA PLSI SGA SSH SVD SVM TF-IDF VB VI VSM K-Nearest Neghbor Latent Drchlet Allocaton Latent Semantc Inexng Markov Chan Monte Carlo Message Passng Interface Pachnko Allocaton Moel Parallel Latent Drchlet Allocaton Probablstc Latent Semantc Inexng Smple Genetc Algorthm Secure Shell Sngular Value Decomposton Support Vector Machnes Term Frequency-Invert Document Frequency Varatonal Bayes Varatonal Inference Vector Space Moule v

Table of Contents Abstract... Acknowlegement... Author s Declaraton... Statement of Copyrght... v Lst of Abbrevatons... v Table of Contents... v Lst of Fgures... x Lst of Tables... xv Chapter 1 Introucton... 1 1.1 Backgroun... 1 1.1.1 Text Mnng Technques... 2 1.1.2 Hgh Performance Computng for Text Mnng... 4 1.2 Motvaton of Work... 7 1.3 Major Contrbutons... 11 1.4 Structure of the Thess... 14 Chapter 2 Lterature Revew... 16 2.1 Introucton... 16 2.2 Probablty Topc Moels... 16 2.2.1 TF-IDF Moel... 18 2.2.2 Mxture of Ungrams... 19 2.2.3 LSI Moel... 20 2.2.3.1 Basc Concepts... 20 2.2.3.2 The Moelng Process... 21 2.2.3.3 The Avantages of LSI... 24 2.2.3.4 The Dsavantages of LSI... 24 2.2.4 PLSI Moel... 25 2.2.4.1 Basc Concepts... 25 v

2.2.4.2 The Moelng Process... 26 2.2.4.3 The Avantages of PLSI... 29 2.2.4.4 The Dsavantages of PLSI... 30 2.2.5 LDA Moel... 30 2.2.5.1 Basc Concepts... 30 2.2.5.2 The Moelng Process... 32 2.2.5.3 The Avantages of LDA... 37 2.3 An Overvew of the Man Inference Algorthms of LDA Moel Parameters. 38 2.3.1 Varatonal Inference (VI)... 38 2.3.2 Belef Propagaton (BP)... 42 2.3.3 Gbbs Samplng (GS)... 44 2.3.4 Analyss an Dscusson... 45 2.4 An Overvew of Genetc Algorthm... 46 2.4.1 The Basc Iea of Genetc Algorthm... 46 2.4.2 The Man Steps of Genetc Algorthm... 47 2.4.2.1 Cong Mechansm... 47 2.4.2.2 Ftness Functon... 48 2.4.2.3 Selecton... 48 2.4.2.4 Crossover... 49 2.4.2.5 Mutaton... 49 2.4.3 The Parameter Settngs of Genetc Algorthms... 50 2.5 An Overvew of Haoop... 51 2.5.1 HDFS... 52 2.5.2 MapReuce Programmng Moel n Haoop... 53 2.5.3 Haoop Scheulng Algorthms... 54 2.5.4 Dvsble Loa Theory... 56 2.6 Summary... 57 Chapter 3 Text Classfcaton wth Latent Drchlet Allocaton... 59 3.1 Introucton... 59 3.2 Overvew of Text Classfcaton... 59 v

3.2.1 The Content of Text Classfcaton... 62 3.2.2 Text Preprocessng... 62 3.2.3 Text Representaton... 63 3.2.4 Text Feature Extracton an Dmenson Reucton... 65 3.2.5 Text Classfcaton Algorthms... 67 3.2.6 Classfcaton Performance Evaluaton System... 71 3.3 Text Classfcaton base on LDA... 72 3.3.1 Gbbs Samplng Approxmate Inference Parameters of LDA Moel... 72 3.3.2 The Specfc Steps of Text Classfcaton... 75 3.4 Experment an Analyss... 77 3.4.1 Expermental Envronment... 77 3.4.2 Tranng Envronment for SVM... 77 3.4.3 The Data Set... 78 3.4.4 Evaluaton Methos... 80 3.4.5 Expermental Results... 82 3.5 Summary... 85 Chapter 4 Accuracy Enhancement wth Optmze Topcs an Nose Processng... 86 4.1 Introucton... 86 4.2 The Metho of Selectng the Optmal Number of Topcs... 86 4.2.1 Current Man Selecton Methos of the Optmal Number of Topcs Base on LDA Moel.... 87 4.2.1.1 The Metho of Selectng the Optmal Number of Topcs Base on HDP... 88 4.2.1.2 The Stanar Metho of Bayesan Statstcs... 90 4.2.2 A Densty-base Clusterng Metho of Selectng the Optmal Number of Topcs n LDA... 91 4.2.2.1 DBSCAN Algorthm... 92 4.2.2.2 The Relatonshp between the Optmal Moel an the Topc Smlarty... 93 4.2.2.3 A Metho of Selectng the Optmal Number of Topcs Base on x

DBSCAN... 95 4.2.3 Experment an Result Analyss... 98 4.3 Nosy Data Reucton... 100 4.3.1 The Nose Problem n Text Classfcaton... 100 4.3.2 The Current Man LDA-base Methos of Nose Processng... 102 4.3.2.1 The Data Smoothng Base on the Generaton Process of Topc Moel... 102 4.3.2.2 The Category Entropy Base on LDA Moel... 106 4.3.3 A Nose Data Reucton Scheme... 110 4.3.4 The Experment an Result Analyss... 113 4.4 Summary... 119 Chapter 5 Genetc Algorthm base Statc Loa Balancng for Parallel Latent Drchlet Allocaton... 120 5.1 Introucton... 120 5.2 The Current Man Parallelzaton Methos of LDA... 121 5.2.1 Mahout s Parallelzaton of LDA... 121 5.2.2 Yahoo s Parallel Topc Moel... 123 5.2.3 The Algorthm of Parallel LDA Base on Belef Propagaton... 124 5.2.4 Google s PLDA... 126 5.2.5 Analyss an Dscusson... 127 5.3 Parallelng LDA wth MapReuce/Haoop... 129 5.3.1 The Workng Process of MapReuce on Haoop... 129 5.3.2 The Algorthm an Implementaton of PLDA... 130 5.4 A Statc Loa Balancng Strategy Base on Genetc Algorthm for PLDA n Haoop... 132 5.4.1 The Algorthm Desgn... 133 5.4.2 The Desgn an Implementaton of Genetc Algorthm... 136 5.4.2.1 Encong Scheme... 136 5.4.2.2 The Intalzaton of Populaton... 137 5.4.2.3 Ftness Functon... 137 x

5.4.2.4 Crossover... 138 5.4.2.5 Mutaton... 139 5.4.2.6 The optmal retenton strategy... 140 5.5 Experment an Analyss... 140 5.5.1 Evaluatng PLDA n Haoop... 140 5.5.1.1 The Expermental Envronment... 141 5.5.1.2 Expermental Data... 143 5.5.1.3 The Experment n the Homogeneous Envronment... 144 5.5.1.4 The Experment n the Heterogeneous Envronment... 147 5.5.2 The Experment of Loa Balancng n a Smulate Envronment... 148 5.6 Summary... 154 Chapter 6 Concluson an Future Works... 155 6.1 Concluson... 155 6.2 Future Works... 161 6.2.1 The Supervse LDA... 162 6.2.2 The Improve PLDA PLDA+... 163 6.2.3 Dynamc Loa Balancng Problem... 164 6.2.4 The Applcaton of Clou Computng Platform... 165 6.2.5 The Applcaton of Interscplnary Fel... 165 References... 168 x

Lst of Fgures Fgure 2.1: The generatve process of the topc moel.... 17 Fgure 2.2: (Left) The ungrams.... 19 (Rght) The mxture of ungrams.... 19 Fgure 2.3: The agram of sngular value ecomposton (SVD)... 22 Fgure 2.4: The graphcal moel representaton of PLSI.... 27 Fgure 2.5: The network topology of LDA latent topcs.... 31 Fgure 2.6: (Left) The structure agram of LDA latent topcs.... 34 (Rght) The graphcal moel representaton of LDA.... 34 Fgure 2.7: The LDA probablstc graphcal moel wth the varatonal parameters. 39 Fgure 2.8: Belef propagaton n the LDA moel base on the factor graph.... 43 Fgure 2.9: The typcal MapReuce framework n Haoop.... 53 Fgure 3.1: The typcal process of automatc text classfcaton.... 60 Fgure 3.2: The separatng hyperplane of SVM algorthm.... 68 Fgure 3.3: Comparson of the performance of three methos on each class.... 83 Fgure 4.1: HDP moel.... 89 Fgure 4.2: The relatonshp between logp(w T) an T.... 98 Fgure 4.3: The ata smoothng base on LDA weakens the nfluence of nose samples.... 106 Fgure 4.4: The flow agram of a nose ata reucton scheme.... 112 Fgure 4.5: The relatonshp between the number of teratons an the effect of nose processng wth fferent nose ratos.... 115 Fgure 4.6: Classfcaton results of fferent methos wth varous nose ratos n frst group of ata.... 116 Fgure 4.7: Classfcaton results of fferent methos wth varous nose ratos n secon group of ata.... 117 Fgure 4.8: Classfcaton results of fferent methos wth varous nose ratos n thr group of ata.... 118 x

Fgure 5.1: The framework of one Gbbs samplng teraton n MapReuce-LDA... 132 Fgure 5.2: The computng tme wth fferent number of ata noes n a homogeneous cluster.... 145 Fgure 5.3: A comparson of computng tme of ealng wth fferent szes of ata wth eght ata noes... 147 Fgure 5.4: The performance comparson of the loa balancng strategy wth the fferent heterogenety n a smulate envronment.... 151 Fgure 5.5: The performance comparson of PLDA wth fferent szes of ata n a smulate envronment.... 152 Fgure 5.6: The convergence of genetc algorthm n the loa balancng strategy.... 153 x

Lst of Tables Table 3.1: The strbuton of fve classes of text n the 20newsgroup corpus.... 79 Table 3.2: Comparson of three methos macro-average an mcro-average.... 84 Table 3.3: Comparson of three methos mensonalty reucton egree to corpus.... 84 Table 4.1: Results of the propose algorthm to fn the optmal value of topc.... 99 Table 4.2: Expermental ata sets.... 113 Table 5.1: The features comparson of mplementng PLDA wth MPI an MapReuce.... 128 Table 5.2: The confguraton of the expermental envronment.... 141 Table 5.3: The confguraton of noes n a Haoop cluster.... 142 Table 5.4: The expermental result of sngle machne an one ata noe n the cluster processng the ata.... 146 Table 5.5: The confguraton of the smulate Haoop envronment.... 150 xv

Chapter 1 Chapter 1 Introucton 1.1 Backgroun So far, the Internet has accumulate a huge number of gtal nformaton nclung news, blogs, web pages, e-books, mages, auo, veo, socal networkng an other forms of ata, an the number of them has been growng at the spee of the exploson contnually [1]. Thus, how people can organze an manage large-scale ata effectvely an obtan the requre useful nformaton quckly has become a huge challenge. For example, the ata s too large to use the tratonal ata analyss tools an technques to eal wth them. Sometmes, even f the ata set s relatvely small, because of untratonal characterstcs of the ata, the tratonal methos also cannot be use [2] [3]. In aton, the expert system technology can put knowlege nto the knowlege base manually by specal users or oman experts. Unfortunately, ths process often has some evaton an mstake, an t s tme-consumng an hgh-cost [4] [5]. Therefore, t s necessary to evelop new technologes an automatc tools whch can convert massve ata nto useful nformaton an knowlege ntellgently. Data mnng s a technque, an t s able to combne tratonal ata analyss methos wth complex algorthms that can eal wth large amounts of ata. Data mnng s a complex process where the unknown an valuable moes or rules are extracte from mass ata. Furthermore, t s an nterscplne, whch s closely relate to atabase system, statstcs, machne learnng, nformaton scence an other scplnes [1] [2] [4]. So, ata mnng can be seen as the result of the natural evoluton of nformaton technology. Accorng to the processng object, t can be ve nto object ata mnng, spatal ata mnng, multmea ata mnng, Web mnng an text mnng [3] [6]. Hgh Performance Latent Drchlet Allocaton for Text Mnng 1

Chapter 1 Text s the most mportant representaton of the nformaton. The statstcs research showe that 80 percent of nformaton n an organzaton was store n the form of text, whch nclue books, research papers, news artcles, Web pages, e-mal an so on [7]. Text s able to express vast an abunant nformaton meanwhle t contans lots of unetecte potental knowlege. The whole text set s not structure ata an t lacks machne-unerstanable semantcs so that t s qute ffcult to eal wth a huge number of ocuments. So n the fel of ata mnng, a new technology whch can process the above text ata effectvely was propose, whch was calle text mnng [5] [6] [8]. 1.1.1 Text Mnng Technques Text mnng was frst propose by Ronen Felman et al n 1995, whch was escrbe lke The Process of extractng nterestng Patterns from very large text collectons for the purpose of scoverng knowlege [6]. Text mnng was also known as text ata mnng or knowlege scovery n texts, whch was a process where the unknown, potental, unerstanable an useful knowlege can be foun from mass text ata [2] [6] [9] [10]. Meanwhle, t was also a process of analyzng text ata, extractng text nformaton an fnng out text knowlege [9] [11] [12]. The man technques of text mnng contan text classfcaton, text clusterng, text summarzaton, correlaton analyss, nformaton extracton, strbuton analyss, tren precton an so on [13] [14] [15]. Here, text classfcaton an text clusterng are the mnng methos whose object s the text set. But, the processng object of text summarzaton an nformaton extracton s a sngle ocument. There are many classfcaton methos n text classfcaton, an the frequently-use methos nclue Natve Bayes (NB), K-Nearest Neghbor (K-NN), Support Vector Machnes (SVM), Vector Space Moule (VSM) an Lnear Least Square Ft (LLSF) [6][7][8][9][12][13][14][15]. Hgh Performance Latent Drchlet Allocaton for Text Mnng 2

Chapter 1 In the fel of text mnng, machne learnng experts researche an put forwar probablstc topc moel, an fast unsupervse machne learnng algorthms were use to fn out the text hen nformaton automatcally [16] [17] [18] [19]. At present, man mnng latent semantc knowlege moels are Latent Semantc Inexng (LSI) [20] [21] [22] [23] [24], Probablstc Latent Semantc Inexng (PLSI) [25] [26] [27] [28] [29] an Latent Drchlet Allocaton (LDA) [30] [31] [32]. Ther applcaton almost covers all areas of text mnng an nformaton processng, such as text summarzaton, nformaton retreval an text classfcaton, etc [33] [34] [35]. Especally, Ble et al put forwar the LDA moel n 2003, whch was wely use to solve text classfcaton, text annotatons an other ssues. Beses, t create a seres of text processng methos whch were base on probablstc topc moelng [17] [36], an t was expane to mages, auo, veo an other multmea ata processng fels [37] [38]. LDA s a probablstc topc moelng whch moels screte ata sets such as text set [30]. It treats ocuments as the probablty strbuton of topcs an smplfes the generatve process of the text, whch helps to hanle large-scale text sets effcently. In aton, LDA s a three-ter Bayesan moel, whch nclues wors, topcs an ocuments three-ter structure. It makes each ocument express as a topc mxture where each topc s a probablty strbuton of the fxe wor lst. In bref, the basc moelng process of LDA s to establsh a ocument-wor co-occurrence matrx frstly. An then, a text tranng set s moele. Next, nference methos are use to obtan moel parameters, such as ocument-topc matrx an topc-wor matrx. Fnally, the learne moel wll be use to prect the topc probablty strbuton of new ocuments so as to express text nformaton [17] [30] [32]. Compare wth other topc moels, LDA has some unque avantages: Frstly, LDA Hgh Performance Latent Drchlet Allocaton for Text Mnng 3

Chapter 1 topc moel s a completely probablstc generatve moel, whch can use mature probablty algorthms to tran moels rectly. An, t s easy to use the moel [39]. Seconly, the sze of the parameter space of LDA s fxe an has nothng to o wth the sze of a text set so that t s more sutable for large-scale text sets [40]. Thrly, LDA s a herarchcal moel, whch s more stable than non-herarchcal moels [39]. 1.1.2 Hgh Performance Computng for Text Mnng In the aspects of processng spee, storage space, fault tolerance an access spee, the tratonal techncal archtecture an sngle computer wth seral-base approach are more an more unsutable to hanle mass ata [41] [42] [43] [44]. Parallel computng s an effectve metho of mprovng the computaton spee an processng capacty of the computer system. Its basc ea s to ecompose a problem nto several parts, an each part s compute by an nepenent processor n parallel [45] [46] [47]. A Parallel computng system can be a supercomputer wth multple processors, an t also can be a cluster whch s mae up of several nepenent computers that are nterconnecte n some way [44] [48]. However, for parallel programmng moels n the tratonal hgh-performance computng, such as PThrea, MPI an OpenMP [49] [50], evelopers have to unerstan bottome confguratons an parallel mplementaton etals. In 2004, Jeffrey Dean an Sanjay Ghemawat who were two engneers of Google propose the programmng ea of MapReuce [51], an apple t to the parallel strbute computng of large-scale ata sets. Accorng to functonal programmng eas, the MapReuce framework can ve a computatonal process nto two phases Map an Reuce [52] [53] [54]. In the Map phase, the nput ata s transmtte to map functons n the form of key-value pars. After operatons, an ntermeate set of key-value pars s generate. In the Reuce phase, all the ntermeate values sets wth the same keys wll be merge. The fnal result wll be output n the form of key-value Hgh Performance Latent Drchlet Allocaton for Text Mnng 4

Chapter 1 pars. For users, they only nee to mplement two functons map an reuce, reucng the complexty of esgnng parallel programmng sgnfcantly [51] [52]. Then, Google CEO Erc Schmt frst propose the concept of clou computng n 2006 [43]. Each bg nternatonal Internet nformaton technology company launche a seres of proucts n successon, showng ther research results [55] [56]. For example, Google launche Google App Engne whch was base on the clou computng envronment evelopment platform. Amazon launche Amazon EC2 (Amazon Elastc Compute Clou) whch was a powerful clou computng platform an Amazon S3 (Amazon Smple Storage Servce) whch can prove clou storage servce. Mcrosoft launche Wnows Azure whch was a clou computng platform. Clou computng s the result of comprehensve evelopment of parallel computng, strbute computng an gr computng. In other wors, clou computng s the commercal mplementaton of the above computng scence concepts [43] [57]. Clou computng s a new computng moe, s also a new composte moe of computer resources, an represents an nnovatve busness moe [55]. The basc prncple of clou computng s to strbute the requre calculaton, transmsson an storage n local computers or remote servers nto a large number of strbute computers, whch means that each computer shares huge tasks of calculaton, transmsson an storage together. Users can gan corresponng servces through the network, mprovng the resource utlzaton rate effectvely an realzng resource acquston on eman [58] [59]. The above clou computng platforms are commercal proucts, whch cannot be freely avalable to the major researchers. Thus, the emergence of open source clou computng platforms s able to gve the major researchers opportuntes. The researchers can use these open source projects to consttute a cluster system whch are mae up of several machnes n a laboratory envronment, an smulate the clou computng envronment n busness systems [57] [58]. Haoop s one of the most Hgh Performance Latent Drchlet Allocaton for Text Mnng 5

Chapter 1 famous open source strbute computng frameworks, whch acheves man technologes of Google clou computng [60] [61] [62]. Bascally, ts core content contans Haoop Dstrbute Fle System (HDFS) [53] [60] [61] [63] an MapReuce programmng moel [64] [65] [66]. Haoop orgnate n the projects of Lucene an Nutch, an then t evelope nto a separate project of Apache founaton. It s manly use for processng mass ata [53] [60] [63]. Beses, t proves a MapReuce framework base on Java wth strong portablty, whch can make the strbute computng apply to large low-cost clusters. Meanwhle, for nvuals an companes, t has a hgh performance so as to lay the founaton for the research an applcaton of ther strbute computng [67] [68]. Yahoo was the frst company whch use, researche an evelope Haoop eeply. In Yahoo, there were more than ten thousan CPUs n a sngle cluster usng Haoop, an there were hunres of researchers n the use of Haoop [52] [54]. Nowaays, t has become the manstream technology of clou computng. For example, unverstes, research nsttutons an the Internet companes research an use t [67]. Along wth the ncreasng popularty of the clou computng concept, Haoop wll get more attenton an faster evelopment [65]. Accorng to the gtze nformaton report whch was publshe by Internatonal Data Corporaton (IDC) n 2011, the amount of global nformaton woul be ouble an reouble every two years [1]. So, there s no oubt that t wll gve the ata storage an computng power of servers much pressure, whch cannot use tratonal topc moel algorthms to process large-scale text sets n hgh-performance servers [44] [69]. In aton, people cannot only hope that the mprovement of computer harware technology s able to enhance the processng effcency, but they nee to use effcent parallel computng technologes to apply to mult-core computers an cluster systems wth hgh-performance. Through a combnaton of harware an software, the accelerate processng can be accomplshe n the process of complex calculatons, solvng the bottleneck of computaton tme an memory capacty. At the Hgh Performance Latent Drchlet Allocaton for Text Mnng 6

Chapter 1 same tme, parallel computng can save cost to a certan extent, an complete large-scale computng tasks wth a lower nvestment [51] [68] [70]. When ealng wth mass ata, the savantage of LDA s that the computaton complexty wll be hgher an processng tme wll be longer [17] [71] [72]. Thus, how LDA can moel large-scale text effectvely, meet the requrements of the computaton tme an memory capacty, an fnally fn out latent topc rules s another bg ssue. In orer to mprove the LDA moel n processng mass ata, the parallel LDA algorthm has become one of the research hotspots. To research MapReuce parallel algorthm an apply t to topc moels has much practcal sgnfcance [51] [52] [66]. Frstly, parallel algorthms can solve the problem of calculaton an storage, an they can process large-scale ata more effcently. Seconly, parallel topc moels are able to aress many practcal applcaton problems, such as text classfcaton, nformaton retreval, text summarzaton an so on [17] [73]. Fnally, parallel topc moels can analyze the massve Internet user behavor ata to obtan useful nformaton, whch s able to be use n socal networkng webstes, ntellgent search an other Internet proucts [18] [72]. 1.2 Motvaton of Work Automatc text classfcaton nvolves ata mnng, computatonal lngustcs, nformatcs, artfcal ntellgence an other scplnes, whch s the bass an core of text mnng. It s an mportant applcaton fel of natural language processng, an also a sgnfcant applcaton technology of processng large-scale nformaton [2] [4] [74]. In short, text classfcaton technology s to make a large number of ocuments ve nto one or a group of categores where each category represents fferent concept topcs. Automatc text classfcaton systems can help users organze an obtan nformaton well, playng a sgnfcant role n mprovng the spee an Hgh Performance Latent Drchlet Allocaton for Text Mnng 7

Chapter 1 precson of nformaton retreval [5] [8]. Therefore, t has a very mportant research value. In text classfcaton, the selecton an mplementaton of classfcaton methos are the core parts of classfcaton systems. How to choose an approprate classfcaton moel s an mportant ssue [75] [76]. In aton, text feature selecton s also another key technology n text classfcaton [77] [78] [79]. The man role of makng probablstc topc moel apply to text classfcaton s to acheve the mensonalty reucton of the text ata representaton space by fnng out latent topc nformaton from ocuments. In the early stage, the most typcal representatve was the LSI moel. The mensonalty reucton effect of ts feature space s sgnfcant, but ts classfcaton performance often ecreases [80] [81]. Furthermore, ts parameter space grows lnearly wth the tranng ata because of the hgh computaton complexty of SVD operaton n LSI. These are lmtatons of LSI when t s apple to practcal problems. PLSI can be seen as a probablstc mprove verson of LSI, whch can escrbe the probablstc structure of the latent semantc space [25] [26]. However, PLSI stll has the problem of the parameter space growng lnearly wth the tranng set [29] [40]. Compare wth LSI an PLSI, the LDA moel s a completely probablstc generatve moel so that t has a complete nternal structure an t s able to use useful probablstc algorthms to tran an make use of the moel. In aton, the sze of the parameter space of the LDA moel has nothng to o wth the number of ocuments. Thus, LDA s more sutable to construct text representaton moel n a large-scale corpus [30] [32]. An, the LDA moel has got successful applcatons n machne learnng, nformaton retreval an other many fels. In the research of text classfcaton, the LDA moel s effectve but ts performance s not remarkable. The reason s that LDA s a non-supervse learnng moel so that t cannot be rectly use n classfcaton. It nees to be embee nto approprate classfcaton algorthms [82]. Hgh Performance Latent Drchlet Allocaton for Text Mnng 8

Chapter 1 The LDA moel tself s a generatve moel, an t s usually ntegrate wth generatve classfcaton algorthms. In ths generatve ntegrate moe, the tranng of classfers oes not use the entre tranng corpus to gan a sngle LDA moel, but the sub-lda moels of corresponng categores are obtane by each category of the tranng corpus. In fact, each sub-lda moel escrbes latent topcs of corresponng categores of ocuments [32] [83]. Because of separate tranng, each category shares a group of topcs, an topcs between categores are solate. But, one of man problems of ths classfcaton metho s that the target ocument wll take place the force strbuton of latent topcs n categores whch o not contan the target ocument, resultng n that the calculaton of the generatve probablty prouces the evaton n these categores so as to reuce the classfcaton performance [84]. Therefore, how to choose a sutable classfcaton algorthm combne wth the LDA moel to construct an effectve classfer that has become a challenge. In the LDA moel, topcs obey Drchet strbuton whch assumes that the emergence of a topc has nothng to o wth the emergence of other topcs. But n the real ata, many topcs have assocatons between them. For example, when PC appears, the probablty of Computer appearng wll be qute hgh but t s unlkely for Hosptal to appear. Obvously, ths nepenent assumpton s nconsstent wth the real ata. So, when usng the LDA moel to moel the whole text set, the number of topcs n the LDA moel wll nfluence the performance of moelng greatly [82]. Therefore, how to etermne the optmal number of topcs n the LDA moel s another research hotspot of LDA. At present, to etermne the optmal number of topcs n the LDA moel has two man methos: the selecton metho base on Herarchcal Drchlet Process (HDP) [86] [87] an the stanar metho n Bayesan statstcs [82]. The former uses the nonparametrc feature of Drchlet Process (DP) to solve the selecton problem of the optmal number of topcs n LDA. But, t nees to establsh a HDP moel an a LDA moel respectvely for the same one ata set. Obvously, when processng practcal Hgh Performance Latent Drchlet Allocaton for Text Mnng 9

Chapter 1 text classfcaton problems, t wll spen a lot of tme n computng [86] [88]. The latter nees to specfy a seres of values of T manually, where T stans for the optmal number of topcs n LDA. After carryng out Gbbs samplng algorthm an the relate calculaton, the value of T n the LDA moel can be etermne fnally [82]. Thus, an algorthm nees to be esgne to fn out the optmal number of topcs automatcally wth low consumng tme an operaton. In aton, n text classfcaton, the qualty of classfers wll affect the fnal classfcaton result greatly. An, the qualty of classfers epens on the qualty of the tranng text set to a great extent. In general, f classes of the tranng text set are more accurate an ts content s more comprehensve, the qualty of the obtane classfer wll be hgher [89] [90]. However, n practcal applcatons, t s qute ffcult to obtan ths hgh qualty of tranng text sets, especally large-scale text sets. Usually, the tranng ata contans nose unavoably, an the nose samples wll have a sgnfcant mpact on the fnal classfcaton result [91]. Usually, a manstream approach of nose processng s to entfy an remove nose samples from ata sets [90] [91]. At present, there are two man nose processng methos base on LDA, the ata smoothng metho base on the LDA moel [92] an the nose entfcaton metho base on the category entropy [93] [94]. They can remove a majorty of nose samples from text sets effectvely to some extent, but they cannot elete all the nose completely [95]. Furthermore, some normal samples wll be wrongly remove as nose unavoably [96]. Therefore, these ssues also become a challenge n text classfcaton. In orer to meet the requrements of storage capacty an computaton spee, the sngle-core seral programmng s turnng to the mult-core parallel programmng technology graually [41] [42]. Parallel programmng applyng to clusters or servers can solve some lmtatons when topc moel learnng algorthms process mass ata [69]. So, parallel topc moels are able to eal wth large-scale ata effectvely. But, Hgh Performance Latent Drchlet Allocaton for Text Mnng 10

Chapter 1 each noe may have fferent computaton spee, communcaton capablty an storage capacty n a heterogeneous cluster so that the loa of tasks s unbalance, whch wll reuce the computatonal effcency of the cluster [70]. Therefore, the loa of tasks n a cluster shoul keep balance as far as possble, ecreasng the synchronzaton watng tme of each noe. The task scheuler manly solves the problem of loa balancng. But n a heterogeneous computng envronment, current man parallel computng platforms lack effectve scheulng algorthms [70] [97]. For nstance, there are three man scheulers n Haoop, Frst n Frst out (FIFO), far scheuler an capacty scheuler. They are applcable to a homogeneous envronment, but they woul waste a lot of resources an the overhea woul be longer n a heterogeneous envronment so as to affect the overall performance of the Haoop platform. An ea of ths thess s to select an esgn an approprate scheulng strategy an corresponng operaton parameters. Compute noes are n workng conton all the tme as far as possble so as to mnmze the overhea of scheulng an synchronzaton, realzng loa balancng an enhancng the performance of parallel programs. 1.3 Major Contrbutons In smple terms, major contrbutons of ths thess are to mprove the performance of the typcal ornary LDA algorthm from two aspects, whch are ncreasng accuracy an speeng up. For the problems of text classfcaton, the optmal number of topcs n the moel, nose processng an ealng wth large-scale ata sets, the approprate an effectve strateges are esgne respectvely base on the LDA algorthm. The followngs are the etals of major contrbutons n ths thess: Frst of all, probablstc topc moels are revewe, such as basc concepts an moelng process of LSI, PLSI an LDA. An, ther avantages an Hgh Performance Latent Drchlet Allocaton for Text Mnng 11

Chapter 1 savantages are analyze an compare. For the problem that latent topcs are assgne by force so as to reuce the classfcaton performance when the tratonal LDA moel s use n text classfcaton, a metho of text classfcaton base on the LDA moel s propose. Its basc ea s to use the LDA algorthm to moel the text set frst an then obtan the probablty strbuton of ocuments n the topc set, whch can ecrease meanngless features to the classfcaton performance an compress the hgh-mensonal sparse feature space. In the classfer esgn, the performance of SVM s better than other text classfcaton algorthms n the aspect of processng hgh-mensonal ata an classfcaton accuracy, so the SVM algorthm s use to construct a text classfer. It can avo the problems of tratonal text representaton methos, such as VSM generates hgh-mensonal sparse feature space. At the same tme, t also can overcome the problem that the classfcaton performance s amage when usng LSI. In aton, usng LDA to moel text sets can reuce the mensonalty of the feature space effectvely, shorten the tranng tme of SVM greatly an mprove the effcency of classfcaton. When the LDA moel s use to moel the whole text set, the number of topcs wll nfluence the moelng sgnfcantly. Thus, accorng to the ea of calculatng sample ensty to measure the correlaton between topcs n the ensty-base clusterng algorthm, a selecton metho of the optmal number of topcs base on Densty-Base Spatal Clusterng of Applcatons wth Nose (DBSCAN) s propose. Frstly, the relatonshp between the optmal topc moel an the topc smlarty s prove n theory. When the average smlarty of the topc structure s the smallest, the corresponng moel wll be optmal. An then, accorng to the constrant conton, the selecton of the optmal value an the parameter estmaton of the LDA moel are combne n a framework. By calculatng the ensty of each topc, the most unstable topc structure s foun. It s terate agan an agan untl the moel s stable, seekng out the optmal number of topcs automatcally. Hgh Performance Latent Drchlet Allocaton for Text Mnng 12

Chapter 1 Combnng the ata smoothng metho base on the LDA moel an the nose entfcaton metho base on the category entropy, a nose ata reucton scheme s put forwar. Its basc ea s to brng n the concept of nformaton entropy frst. After nose processng base on the category entropy, most of nose samples have been remove, mprovng the qualty of the tranng text set. An then, the generatve process of the LDA topc moel s use to smooth ata so that the qualty of the tranng text set gets further mprovement. At ths moment, an teratve process of the algorthm s complete. But, the tranng text set s usually affecte by nose samples after one tme of teraton. Thus, the whole process wll nee K tmes of teraton, where the smoothe tranng text set can be seen as the ntal tranng ata of the next teraton. After fnshng teratons, the obtane tranng text set s the fnal tranng text set, whch wll be apple to classfcaton. The propose strategy weakens the nfluence of mssng out nose samples n the nose flterng phase. Meanwhle, t mantans the sze of the orgnal tranng set. By comparng an analyzng the current man methos of parallel LDA, the PLDA algorthm an ts mplementaton process n the Haoop platform are stue n etal. For the problem of loa balancng n a heterogeneous envronment, a statc loa balancng strategy base on a genetc algorthm for PLDA s propose. Accorng to the optmal prncple of the vsble loa theory, each ata noe shoul complete local Gbbs samplng at the same tme theoretcally. In aton, the concept of the mean square error s use to convert t nto the ftness functon of the genetc algorthm. In fact, to partton ata before the start of tasks can mnmze the synchronzaton watng tme between ata noes an mprove the performance of the Haoop cluster n a heterogeneous envronment. Hgh Performance Latent Drchlet Allocaton for Text Mnng 13

Chapter 1 1.4 Structure of the Thess Chapter 1 ntrouces the backgroun an motvaton of research works of ths thess. Beses, t escrbes major contrbutons of ths thess an Chapters arrangement brefly, whch mae sgnfcance, content an plan of the thess clear. Chapter 2 frst summarzes probablstc topc moels, ntrouces the basc concepts an the moelng process of LSI, PLSI an LDA n etal an analyzes an compares ther avantages an savantages. An then, t summarzes three man nference algorthms of LDA moel parameter, such as varatonal nference, belef propagaton an Gbbs samplng. Next, t ntrouces the basc concepts, man steps an parameter settngs of genetc algorthms. Fnally, the relate concepts of Haoop are overvewe, whch contans HDFS, MapReuce programmng moel, scheulng algorthms of Haoop an the vsble loa theory. Chapter 3 summarzes key technologes of text classfcaton, nclung text preprocessng, text representaton moel, text feature extracton an menson reucton, text classfcaton moel, an classfcaton performance evaluaton. After ntroucng the Gbbs samplng algorthm, a text classfcaton metho base on the LDA moel s propose. An then, the LDA topc moel an SVM classfcaton algorthm are combne to bul a text classfcaton system. At last, the valty an avantages of the propose classfcaton metho are prove by contrast experments. Chapter 4 ntrouces two current man methos whch can etermne the optmal number of topcs n the LDA moel, the stanar metho of Bayesan statstcs an the metho of selectng the optmal number of topcs base on HDP. An then, accorng to the ensty-base clusterng algorthm, the algorthm of selectng the optmal number of topcs base on DBSCAN s put forwar. The experments prove ts feasblty an valty. In aton, t also ntrouces two nose processng methos base on the LDA moel, nclung the ata smoothng base on the LDA Hgh Performance Latent Drchlet Allocaton for Text Mnng 14

Chapter 1 moel an the nose entfcaton base on the category entropy. Next, two methos are combne, an a nose ata reucton scheme s propose. At the en, the experments verfy ts valty an the stable performance. Chapter 5 summarzes an compares the current man methos of parallel LDA, nclung Mahout s parallelzaton of LDA, Yahoo s parallel topc moel, the algorthm of parallel LDA base on Belef Propagaton (BP) an Google s PLDA. After stuyng the PLDA algorthm an ts mplementaton process n the Haoop framework, a statc loa balancng strategy for PLDA base on a genetc algorthm s propose. An, t escrbes the esgn an the mplementaton of the propose strategy n etal. Fnally, n the real homogeneous an heterogeneous Haoop cluster envronments, the experments show that the PLDA algorthm s an effectve metho to eal wth mass ata. Moreover, n a Haoop smulaton envronment, the experment proves that the propose strategy s effectve an stable, an t can enhance the computatonal effcency of the Haoop cluster. Chapter 6 summarzes the major research works an contrbutons of ths thess. In aton, t ponts out the lmtatons of the thess an the recton of future research works. Hgh Performance Latent Drchlet Allocaton for Text Mnng 15

Chapter 2 Chapter 2 Lterature Revew 2.1 Introucton Ths Chapter ntrouces probablty topc moels frst, an compares an analyzes the avantages an savantages of LSI, PLSI an LDA. An then, the man nference algorthms of LDA moel parameters are escrbe, such as varatonal nference, belef propagaton algorthm an Gbbs samplng. Fnally, the overvew of genetc algorthm, the framework of Haoop whch nclues HDFS, MapReuce an man scheulng algorthms an the vsble loa theory s presente. 2.2 Probablty Topc Moels Recently, probablty topc moels have ha very we range of applcatons. For example, they have obtane a qute goo applcaton effect n the fels of text classfcaton an nformaton retreval [18]. If a ocument collecton s gven, probablty topc moels wll fn a set of low-mensonal polynomal strbuton by parameter estmaton. Here, each polynomal strbuton s calle a topc, whch s use to capture the relate nformaton among wors. They can extract the comprehensble an stable latent semantc structure wthout the computer really unerstanng natural language, fnng a relatvely short escrpton for the ocuments n the large-scale ata set [98]. The man ea of probablty topc moels s to assume that the ocument s a mxe strbuton of some topcs, an each topc s a probablty strbuton of wors. Thus, probablty topc moels can be seen as a text generaton moel. The generaton of ocuments s a smple probablty process base on topc moels [18] [19]. For example, when a new text s to be generate, a topc strbuton wll be gane frst. Hgh Performance Latent Drchlet Allocaton for Text Mnng 16

Chapter 2 Next, for each wor n the text, a topc s obtane ranomly by the topc strbuton. At the en, a wor wll be got ranomly by the wor strbuton of the topc. Topc 1 Topc 2 money 0.446 bank 0.277 loan 0.277 rver 0.388 bank 0.306 stream 0.306 money1 bank1 loan1 bank1 money1 loan1 bank1 loan1 money1 bank1 rver2 bank2 loan1 stream2 money1 bank2 rver2 bank2 stream2 bank2 rver2 stream2 rver2 bank2 Document1 Document2 Document3 Fgure 2.1: The generatve process of the topc moel. Fgure 2.1 shows a smple text generaton process. Topc 1 an topc 2 are assocate wth money an rver, respectvely. They have fferent wor strbutons so that they can consttute ocuments by choosng the wors whch have fferent mportance egree to the topc. Document 1 an ocument 3 are generate by the respectve ranom samplng of topc 1 an topc 2. But, topc 1 an topc 2 generate ocument 2 accorng to the mxture of ther fferent topc strbutons. Here, the numbers at the rght se of a wor are ts belongng topc numbers. An, the wor s obtane by the ranom samplng of the numbere topc. In ths secton, several tratonal topc moels are ntrouce brefly. An then, the man nference algorthms of LDA moel parameters wll be escrbe n etal n the next secton. Hgh Performance Latent Drchlet Allocaton for Text Mnng 17

Chapter 2 2.2.1 TF-IDF Moel The Term Frequency-Invert Document Frequency (TF-IDF) moel was propose n 1983 [99]. Its approach s to establsh a V D matrx, where V represents a vocabulary that contans all the possble wors. An, V s the sze of the vocabulary. D stans for a text set, an D s the sze of the text set. For each wor, the Term Frequency (TF) of them s calculate n all the ocuments, an the nverse of the number of all the ocuments whch contan the wor s also compute. Fnally, the prouct of TF an Invert Document Frequency (IDF) s store n the corresponng poston of the matrx. Accorng to the core ea of the TF-IDF moel, f the frequency of a wor appearng n the same one ocument s hgher, whch can be measure by TF, or the frequency of the wor appearng n all the ocuments s lower, whch can be measure by IDF, the wor wll be more mportant to the ocument. Thus, the mportance of the whole ocument can be obtane by calculatng the mportance of all the wors [103] [104]. Through some smple normalzaton processng, the probablty strbuton of these wors n the whole text set can be gane [100]. However, TF-IDF s not a real topc moel. Its man contrbuton s to convert the ocuments wth unfxe length to the matrx wth fxe length. It oes not entfy the semantcs, an t cannot recognze an eal wth synonyms an polysemes [33]. The reason s that all the wors are just smple textual characters n the TF-IDF moel. Another problem of TF-IDF s that the requre memory space an computatonal tme are very huge. In orer to contan more wor nformaton, t s necessary to expan the sze of the vocabulary. In aton, for more moel tranng, the number of the tranng ocuments has to ncrease [101]. Therefore, the neee storage space whch saves the TF-IDF matrx wll ncrease by several tmes. Hgh Performance Latent Drchlet Allocaton for Text Mnng 18

Chapter 2 2.2.2 Mxture of Ungrams Fgure 2.2 (Left) shows that the boxes are plates that represent replcates. The outer plate represents ocuments, an the nner plate represents the repeate choce of wors wthn a ocument. In aton, the flle crcle represents the observe varable w. In the ungrams, the wors n each ocument are generate from a polynomal strbuton nepenently, whch can be expresse as Equaton (2.1). N p( w) p( ) (2.1) n1 w n where N s the number of wors an w n represents one of wors. w Document Collecton z w Document Collecton Fgure 2.2: (Left) The ungrams. (Rght) The mxture of ungrams. If a screte ranom topc varable z s ae nto the ungrams, they wll become the mxture of ungrams, whch s shown n Fgure 2.2 (Rght). The hollow crcle represents the latent varable z, an the black arrow represents the contonal epenency between two varables, namely the contonal probablty. In the mxture of ungrams, for each ocument, a topc z s chosen frst. An then, accorng to the contonal polynomal p ( w z), N wors of the ocument are generate nepenently [100]. So, the probablty of each ocument can be obtane as Hgh Performance Latent Drchlet Allocaton for Text Mnng 19

Chapter 2 Equaton (2.2): n1 P ( w) P( z) P( w z) P( w z) P( z) z N n z (2.2) where N s the number of wors, w n represents one of wors, w represents the wor varable an z represents the topc varable. When the whole text set s estmate, the mxture of ungrams wll assume that each ocument only express one sngle topc. An, the strbuton of the wors can be seen as the topc representaton uner the assumpton of each ocument corresponng to one topc [102], whch can be expresse as Equaton (2.3): P( w) P( w z) P( z) P( z) P ( ) P( w z) (2.3) w w z z w where represents the ocument varable, w represents the wor varable an z represents the topc varable. Because of the lmtaton of the above assumpton, the mxture of ungrams s often neffectve when ealng wth large-scale text set. 2.2.3 LSI Moel 2.2.3.1 Basc Concepts At the earlest, the ea of probablty topc moel was orgnate from Latent Semantc Analyss (LSA). LSA was also known as Latent Semantc Inexng (LSI), whch was propose by Scott Deerwester et al as a semantc space moel n 1990 [20]. It s an extenson of the Vector Space Moel (VSM), aressng the problems of TF-IDF. LSI assumes that the wors n the ocuments have some latent semantc structure. For example, there s bascally the same semantc structure between the synonyms. An, the polysemes must have a varety of fferent semantc structures. Its prncple s to use the Sngular Value Decomposton (SVD) technque to convert the matrx of TF-IDF to a sngular matrx. By removng the small sngular value vectors an only retanng the top K maxmum values, the orgnal ocument vector Hgh Performance Latent Drchlet Allocaton for Text Mnng 20

Chapter 2 an the query vector are mappe from the wor space to a K-mensonal semantc space. In the low-mensonal space, the latent semantc relaton of the TF-IDF matrx wll be kept, elmnatng the nfluence of synonyms an polysemes an mprovng the accuracy of text representaton [22] [23]. The LSI moel has been wely use n text classfcaton because of ts sgnfcant menson reucton effect [20]. In fact, through the SVD of the wor-ocument matrx, the LSI moel maps the orgnal matrx to a K-mensonal latent semantc space, where K stans for the number of chosen the largest sngular values. After mappng, the sngular value vector can farthest reflect the epenence relatonshp between wors an ocuments. So, the mensonalty of the latent semantc space s much smaller than that of the orgnal space so as to acheve the purpose of menson reucton [21]. In the low-mensonal space, the ocuments whch orgnally o not contan or just contan a few same wors may have the large smlarty because of the co-occurrence relaton between the wors. 2.2.3.2 The Moelng Process At present, the classc metho of sngular value ecomposton n matrx s use to acheve LSI. One of ts avantages s that ts ecomposton effect s goo an ts scalablty s strong [20] [105]. Frst of all, accorng to a tranng corpus, a ocument-wor matrx C s constructe. In the matrx, each element stans for the frequency of the corresponng wor appearng n a ocument. Accorng to the SVD algorthm, the matrx C can break up nto three matrxes, whch s shown n Fgure 2.3 an can be expresse as follows: C T UV (2.4) where C means a n m matrx, U stans for a n n orthogonal matrx, V represents a m m orthogonal matrx, an s efne as follows: ag,,, ) (2.5) ( 1 2 Hgh Performance Latent Drchlet Allocaton for Text Mnng 21

Chapter 2 where mn{ n, m} an 0. Equaton (2.4) s referre to as the 1 2 sngular value ecomposton form of C.,, 2, are the sngular values of C. 1 The column vector of U s left sngular vector of C, an the row vector of V s rght sngular vector of C. Fgure 2.3: The agram of sngular value ecomposton (SVD). As shown n Fgure 2.3, the matrx s a agonal matrx. The agonal elements whch are the sngular values of the orgnal matrx C are set to nonzero value n the escenng orer. The sngular values of the matrx represent the characterstcs of the matrx, an the larger sngular values can stan for the man characterstcs of the matrx. Therefore, the smaller sngular values can be gnore so as to compress lots of space. When the matrx only keeps a part of values on the agonal, namely k rows an k columns, the rght part of U an the bottom part of V wll reuce automatcally. At ths moment, three new matrxes are generate, such as U k ( n k), ( k k) an V k ( k m). An then, the three matrxes are multple to obtan a new k smlar matrx C k. C k U V (2.6) k k k Hgh Performance Latent Drchlet Allocaton for Text Mnng 22

Chapter 2 The new matrx C k has the followng two characterstcs: C k gnores the smaller sngular values so that some nose s remove from the orgnal matrx C to some extent. The elements n the matrx C stan for the frequency of the wor appearng n the ocument, so t must be a hgh orer an sparse matrx an ffcult to use rectly. But ue to the eleton of the smaller sngular values, each poston n the new matrx C k almost has a value. Base on the new features of C k, f t multples ts own transposton, the smlarty matrx between the wors wll be gane. The smlarty matrx s a mathematcal representaton of the semantc space, whch can be apple to keywors expanson an text classfcaton [22]. For example, the cosne value between two row vectors of C k or two column vectors of U can reflect the smlarty of the frequency of k k two wors appearng n the text set. If the value s 1, the two wors wll have the same occurrence stuaton. On the contrary, f the value s 0, they wll have completely fferent occurrence stuaton. If two ocuments are compare, the cosne value between two column vectors of C k or two row vectors of V wll be able to k k reflect the smlarty between the two ocuments [22]. When a new ocument s compare wth the ocuments n the tranng set, a vector ' can be obtane by 1 ' T U kk. An then, the cosne value between the vector k ' an Vk k reflects the content smlarty between the new ocument an the ocuments n the tranng set. In text classfcaton, the tranng text set s usually requre to expan. If every expanson nees to carry out the sngular value ecomposton for the whole tranng text set agan, t wll spen too much tme. Especally when ealng wth large-scale ata set, t may not be able to work because of the lmte storage space. So, a smple Hgh Performance Latent Drchlet Allocaton for Text Mnng 23

Chapter 2 metho, whch s calle folng-n, can solve ths problem [24]. For example, f a new text vector s ae, ' wll be ae to the column of V k. If a new wor vector w s ae, w wll be mappe to a new ocument vector space, namely the column of V, gettng w k w T V ' 1 k k. An then, the wor vector ' w wll be ae to the column of U k. Therefore, the metho of folng-n nees less tme an storage space than recalculatng the whole tranng text set. 2.2.3.3 The Avantages of LSI LSI can be seen as an mprovement of VSM, whch brngs n the concept of latent semantc [23]. It can realze the semantc retreval to a certan extent, an elmnate the nfluence whch s cause by synonyms an polysemes of some wors. SVD n LSI s use to acheve the purpose of nformaton flterng an removng nose. Meanwhle, matrx reuce-rank makes hgh-mensonal representaton of the ocument n VSM map nto the low-mensonal representaton of the latent semantc space, reucng the scale of the problem greatly [22] [23]. SVD n LSI only executes mathematcal treatment to the matrx, whch oes not nee grammar, semantcs an other basc knowlege of natural language processng [20]. Bascally, t s an nflexble an easy metho. In the orgnal matrx, a wor only appears n a few ocuments so that many element values of the matrx wll be zero. The ata sparseness problem makes matrx manpulaton qute ffcult. But after SVD, the space menson s reuce greatly, makng the ata sparseness problem get some mprovement [21]. 2.2.3.4 The Dsavantages of LSI The physcal meanng SVD s not clearly efne. It s har to control the effect of the classfcaton an clusterng of meanng of wors [25]. Hgh Performance Latent Drchlet Allocaton for Text Mnng 24

Chapter 2 The process of the SVD algorthm cannot be controlle because ts tme, space an complexty are too large. Thus, t s ffcult to eal wth large-scale ata an real applcatons uner the current operatonal capablty [26]. The upate matrx after the SVD algorthm has both postve an negatve values, whch means that the value of the smlarty between wors an matrx may be negatve [25]. So, t s har to present a efnte physcal meanng of the matrx. In aton, the uncertanty of the value of the smlarty woul brng some ffcultes to further applcatons. The effect of solvng synonyms an near-synonyms s not obvous although ts menson reucton effect s sgnfcant. The reason s that the process of menson reucton s nflexble wthout the pror nformaton [29]. Therefore, the fnal performance of text classfcaton wll be often amage greatly [100]. It s ffcult to etermne the value of K n the SVD algorthm. Generally, the value of K s often etermne by emprcal equatons va comparng possble choces one by one [25] [26]. 2.2.4 PLSI Moel 2.2.4.1 Basc Concepts The secon major breakthrough of probablty moels was the Probablstc Latent Semantc Inexng (PLSI) moel, whch was presente by Hofmann n 1990 [25] [26]. PLSI s use to smulate the wors generaton process, extenng LSI to the framework of probablty statstcs. It reesgns the ea of generatng moels. It abanons the metho of matrx transformaton n LSI, but makes use of the generatve moel. Compare wth the non-generatve moel, the generatve moel escrbes the reason of generatng some probablty ensty functons an the process of the nteracton between the factors. But, the non-generatve moel gnores all the generatve process, an presents the fnal result of varous factors superposton rectly. Hgh Performance Latent Drchlet Allocaton for Text Mnng 25

Chapter 2 For a complex constructon process, the non-generatve moel often nees a certan ealze processng to obtan the result by mathematcal ervaton. The generatve moel tens to use the approxmate calculaton metho to approxmate the exact soluton. Consere the computng performance, the non-generatve moel even sometmes can show an exact soluton through mathematcal equatons, but the computng tme an complexty are unacceptable because the computatonal equatons often nvolve some hyper-functons, such as summaton, proucts an contnuous ntegraton. Therefore, the moels cannot be represente by the classcal algebra. PLSI s a probablty moel. Its man ea s to construct a semantc space where the menson s not hgh. An then, all the wors an ocuments are treate equally, an mappe to the semantc space. In ths way, t not only solves the problem of hgh menson, but also reflects the relatonshp between the wors. For example, f the semantcs of the wors s much closer, the ponts corresponng to the wors n the semantc space wll be also closer. In the process of constructng the mappng, the PLSI moel uses the teratve Expectaton Maxmzaton (EM) algorthm, makng t more effcent [27] [28]. 2.2.4.2 The Moelng Process Frst of all, base on the probablty prncple, the relatonshp between the ocument an the wor w s establshe, whch s the probablty smlarty between them an can be expresse as P (, w). Fgure 2.4 s a schematc agram of the PLSI moel. The boxes are plates that represent replcates. The outer plate represents ocuments, an the nner plate represents the repeate choce of latent topcs an wors wthn a ocument. In aton, the flle crcles represent the observe varable an w, the hollow crcle represents the latent varable z, an the black arrow represents the contonal epenency between two varables, namely the Hgh Performance Latent Drchlet Allocaton for Text Mnng 26

Chapter 2 contonal probablty. PLSI not only uses the ea of the generatve moel, but also reflects the nternal relatons between an w. Beses, n orer to compress the ata, the latent mult-mensonal varable z s brought n, whch can be unerstoo as the semantc space or the topc space. The menson of the space s a constant whch s set manually. PLSI assumes that an w are contonally nepenent an the strbuton of z on or w s contonally nepenent. z w Document Collecton Fgure 2.4: The graphcal moel representaton of PLSI. Seconly, accorng to the above assumptons an the contonal probablty equaton, the followng equaton can be obtane: P (, w) P( z) P( w z) P( z) (2.7) zz where P (, w) represents the probablty smlarty between an w, P (z) means the latent semantc pror probablty, P ( w z) stans for the contonal probablty of z on w, an P ( z) shows the contonal probablty of z on. In orer to gan the above parameters n the unsupervse tranng, t s necessary to use EM algorthm to ft the latent semantc moel [25] [28]. Dfferent wth the SVD, the ecson functon n PLSI s a more statstcally sgnfcant mnmum Hgh Performance Latent Drchlet Allocaton for Text Mnng 27

Chapter 2 Hgh Performance Latent Drchlet Allocaton for Text Mnng 28 entropy prncple, whch s to maxmze the followng functon: D W w w P w f L ), ( )log, ( (2.8) where ), ( w f means the ntal value, whch s the frequency of w appearng n. Thrly, the followng stanar EM algorthm s use to optmze [25] [27] [28]. E-step: Z z z w P z P z P z w P z P z P w z P, ) ( ) ( ) ( ) ( ) ( ) ( ), ( ' ' ', (2.9) M-step: ',, ' ), ( ), ( ), ( ), ( ) ( w w z P w f w z w P f z w P, (2.10) w w w z P w f w z w P f z P, ' ' ' ), ( ), ( ), ( ), ( ) (, (2.11) w w z P w f R z P, ), ( ), ( 1 ) (, (2.12) w w f R, ), ( (2.13) After the ntalzaton wth the ranom numbers, E step an M step are use alternately to carry out the teratve calculaton. After many teratons, the probablty smlarty matrx ), ( w P between an w can be gane. Furthermore, the probablty smlarty matrx ), ( w w P between the wors an ), ( P between the ocuments also can be calculate, whch can be apple to nformaton retreval an text classfcaton, etc. For example, the output matrx ), ( w P nees to multply ts own transposton, obtanng the probablty smlarty matrx ), ( w w P. D w P w P w w P ) ( ) ( ), ( 2 1 2 1 (2.14) The core ssue of text retreval s to calculate the smlarty between keywors an

Chapter 2 ocuments. The quere keywors are constructe to the wor vector w q, an ts menson s equal to the row vector menson of the matrx P ( w, w). The smlarty between the quere keywors an the ocument n can be calculate as follow: Smlarty ( w, ) w P( w, w) w (2.15) q n q T n In aton, the key ssue of text classfcaton s to calculate the smlarty between the ocuments. Assume that the wor vector w o s extracte from the ocument o, an ts menson s equal to the row vector menson of the matrx P ( w, w). Here, the elements of w o are the normalze values of the frequency of wors appearng n the ocument. Smlarly, the wor vector w n of the ocument n can be gane. So, the smlarty between o an n can be compute as below: Smlarty (, ) w P( w, w) w (2.16) o n o T n 2.2.4.3 The Avantages of PLSI The latent semantc space z of PLSI has a clear physcal meanng, representng the latent topc. In aton, other probablty values also have ther corresponng physcal meanngs [25] [26]. PLSI coul solve the problem of synonyms an polysemes effectvely, an use Expectaton Maxmzaton (EM) algorthm to tran latent classes [27]. Compare wth LSI, t has strong statstcal bass. PLSI uses the EM algorthm to get solutons by teraton whle computng moel, reucng tme complexty greatly an rasng computng spee [28]. Thus, t s easer to acheve than SVD algorthm. Hgh Performance Latent Drchlet Allocaton for Text Mnng 29

Chapter 2 2.2.4.4 The Dsavantages of PLSI EM of PLSI s a completely unsupervse algorthm so ts convergence s slow whle the algorthm terates [29]. After gettng the matrx P (, w) by the EM algorthm of PLSI, f there s a new batch of tranng ata an the new upate matrx P (, w) s requre, the only metho s to terate the algorthm from the begnnng. Obvously, the effcency s poor [39]. The parameters space of PLSI s proportonal to the tranng ata of PLSI so that t s not goo for moelng large-scale or ynamc growth corpus [40]. The PLSI moel nees to get a pror probablty, whch s only base on the exstng tranng set. For the text outse of the tranng set, there s no sutable pror probablty [29]. Along wth the ncrease of the tranng samples, the sze of the matrx wll ncrease lnearly. Thus, PLSI has the problem of overfttng. The overfttng s that there are too many screte characterstcs n the PLSI moel. Beses, these screte characterstcs are only sutable for the tranng text set but cannot escrbe other text whch s outse of the tranng set [39] [40]. 2.2.5 LDA Moel 2.2.5.1 Basc Concepts Amng at aressng the above problems n PLSI, Ble et al put forwar the Latent Drchlet Allocaton (LDA) moel n 2003 [17] [30] [32]. Base on the PLSI moel, LDA use a K-mensonal latent ranom varable whch obeys the Drchlet strbuton to represent the topc mxture rato of the ocument, whch smulates the generaton process of the ocument. At present, the LDA moel s one of the most popular probablty topc moels, an t has more comprehensve assumptons of text generaton than other moels. As shown n Fgure 2.5, n the process of text Hgh Performance Latent Drchlet Allocaton for Text Mnng 30

Chapter 2 generaton, LDA uses samplng from the Drchlet strbuton to generate a text wth the specfc topc multnomal strbuton, where the text s usually compose of some latent topcs. An then, these topcs are sample repeately to generate each wor for the ocument. Thus, the latent topcs can be seen as the probablty strbuton of the wors n the LDA moel. An, each ocument s expresse as the ranom mxture of these latent topcs accorng to the specfc proporton. Document Topcs Topcs Wors Fgure 2.5: The network topology of LDA latent topcs. Wors In aton, Grolam et al ponte out that the PLSI moel can be seen as a specal case of the LDA moel when the Drchlet strbuton egenerates to untary n 2003 [39]. The LDA moel nherts all the avantages of the PLSI moel wthout ang any ealze conton, escrbng a semantc moel more factually. For example, n the selecton of topcs, PLSI nees to etermne a text class label. Accorng to the label, the topc strbuton s generate, where the probablty between the fferent topcs s nepenent wth each other. But, the Drchlet strbuton s use to generate a topc set n LDA. The topc set s not a seres of nepenent parameters of PLSI, but t s represente by the latent ranom varables. So, LDA matches the semantc contons better than other moels n realty, an t has a stronger escrptve power. Compare wth the PLSI moel, the parameter space of LDA s Hgh Performance Latent Drchlet Allocaton for Text Mnng 31

Chapter 2 smple. Beses, the sze of the parameter space has nothng to o wth the number of tranng ocuments n LDA so that there s no overfttng stuaton. Therefore, the LDA moel s a complete probablty generatve moel [39] [40]. 2.2.5.2 The Moelng Process It s necessary to brefly explan that how the LDA moel generates a ocument. For example, suppose that a corpus contans three topcs, such as sports, technologes an moves. A ocument whch escrbes the flm proucton process may contan both the topc technologes an the topc moves. The topc technologes have a seres of wors that are relate wth technologes. An, these wors have a probablty, whch means the probablty of these wors appearng n the ocument wth the topc technologes. Smlarly, there are a seres of wors that are relate wth moves n the topc moves, an there s a corresponng probablty. When generatng a ocument about flm proucton, the topcs wll be chosen ranomly frst. The probablty of choosng two topcs of technologes an moves wll be hgher. An then, a wor wll be selecte. The probablty of choosng the wors that are relate wth the two topcs wll be hgher. So, the selecton of a wor s complete. After selectng N wors contnually, a ocument s create. The LDA moel s a probablty generatve moel whch moels screte ata sets, s a three-ter Bayesan moel, an s a metho of moelng the topc nformaton of ocuments [30] [32]. It escrbes ocuments brefly an keeps the essental statstcal nformaton, helpng to process large-scale text set effcently. As shown n Fgure 2.6 (Left), the text set lke the top great crcle can be ve nto several latent topcs ( 1 2 n t, t, L, t ) lke the bottom small crcles, an the topologcal structure of these latent topcs s lnear. Furthermore, to use probablty nference algorthms can express a sngle ocument as the mxture of the specfc proporton of these latent topcs. Hgh Performance Latent Drchlet Allocaton for Text Mnng 32

Chapter 2 Fgure 2.6 (Rght) s the graphcal moel representaton of LDA. The boxes are plates that represent replcates. The outer plate represents ocuments, whch means that the topc strbuton s sample repeately from the Drchlet strbuton for each ocument n the ocument collecton. An, the nner plate represents the repeate choce of topcs an wors wthn a ocument, whch means that wors n a ocument are generate by repeate samplng from the topc strbuton. In aton, the hollow crcles represent the latent varables,, an z, the flle crcle represents the observe varable w, an the black arrow represents the contonal epenency between two varables, namely the contonal probablty. Furthermore, Fgure 2.6 (Rght) shows that the LDA moel s a typcal recte probablty graph moel wth a clear herarchcal structure, such as the ocument collecton-ter, the ocument-ter an the wor-ter. The LDA moel s etermne by the parameters (, ) of the ocument collecton-ter, where reflects the relatonshp between the latent topcs n the ocument collecton an reflects the probablty strbuton of all the latent topcs themselves. an are assume to be sample once n the process of generatng a ocument collecton. The ranom varable s the ocument-ter parameter, an ts components stan for the proporton of varous latent topcs n the estnaton ocument. s sample once per ocument. In aton, z an w are the wor-ter parameters, where z represents the proporton of the estnaton ocument assgnng the latent topcs to each wor an w s the representaton of the wor vector of the estnaton ocument. z an w are sample once for each wor n each ocument. In bref, for a ocument, ts topc strbuton s a Drchlet pror functon base on the parameter. For a wor w n a ocument, ts topc z s generate by the topc strbuton, an the wor w s generate by the parameter. Hgh Performance Latent Drchlet Allocaton for Text Mnng 33

Chapter 2 Fgure 2.6: (Left) The structure agram of LDA latent topcs. (Rght) The graphcal moel representaton of LDA. Therefore, the process of LDA probablty topc moel generatng a ocument s escrbe as follows [30]: Step 1: For the topc j, accorng to the Drchlet strbuton Dr (), a wor multnomal strbuton vector ( j) on j s obtane. Step 2: Accorng to the Posson strbuton Posson ( ), the number of wors N n the ocument s gane. Step 3: Accorng to the Drchlet strbuton Dr ( ), a topc probablty strbuton vector of the ocument s got, where s a column vector an s a parameter of the Drchlet strbuton. Step 4: For a wor w n of N wors n the ocument: A topc k s chosen ranomly from a multnomal strbuton Multnomal ( ) of. A wor s selecte from a multnomal contonal probablty strbuton Hgh Performance Latent Drchlet Allocaton for Text Mnng 34

Chapter 2 (k ) Multnomal ( ) of the topc k, whch can be seen as w n. If there are T latent topcs, the probablty of the -th wor w appearng n the ocument can be expresse as follows: T P( w ) P( w z j) P( z j) (2.17) j1 where w s the -th feature wor n the ocument, j s the j-th latent topc an z j means that w takes the j-th latent topc. P( w z j) s the probablty of the wor w belongng to the topc j. P( z j) s the probablty of the topc j belongng to the ocument. Let the -th topc j be expresse as the multnomal ( z j) strbuton of w wors, whch can be represente as P( w z j). The ocument can be expresse as the ranom mxture of T latent topcs lke w ( ) z j P( z j). So, the probablty of the wor w appearng n the ocument can be shown as below: T P( w ) (2.18) j1 ( z j) w ( ) z j In orer to make the moel eal wth new ocuments outse of the tranng corpus ( ) easly, the pror probablty of the Drchlet ( ) n s assume n the LDA ( z) moel [30]. Smlarly, the pror probablty Drchlet ( ) n s suppose so as to nfer the moel parameters, as follows: ( z ) ( z ) ( z ) w z, ~ Dscerte ( ), ~ Drchlet ( ) (2.19) ( ) ( ) ( ) z ~ Dscerte ( ), ~ Drchlet ( ) (2.20) where the extent of usng topcs an wors s etermne by the specfc values of an, an fferent wors an topcs are use n the same way bascally. Thus, t Hgh Performance Latent Drchlet Allocaton for Text Mnng 35

Chapter 2 can be assume that all the values of are the same an all the values of are also the same, whch can be calle the symmetrcal Drchlet strbuton [32]. The LDA moel uses the Drchlet strbuton as the conjugate pror of the multnomal strbuton an, smplfyng the statstcal ervaton of the moel [30]. For T-mensonal multnomal strbuton p p p,, p ), the corresponng Drchlet strbuton probablty ensty s as follows: ( j j1 ( 1 T T j ) j 1 Dr( 1,, T ) p j (2.21) ( ) j j where the parameters,, are calle the hyper-parameter of the multnomal 1 T p p p,, p ). j s the pror observaton of the frequency of topc j ( 1 T appearng n the ocument, whch means that topc j has appeare j tmes before samplng any real wor n the ocument. In orer to smplfy the moel, LDA uses the symmetrcal Drchlet strbuton, namely let 1 2 T. Obvously, Drchlet pror strbuton can smooth the strbuton of the multnomal p p p,, p ) by the ( 1 T hyper-parameter, preventng the problem of overfttng to the ata set. If the value of s greater, the multnomal strbuton wll be more converge to the center. Therefore, va choosng an approprate, the problem of overfttng can be avoe [17]. However, f the value of s too large, the obtane strbuton wll not reflect the real strbuton of the ata set. As shown n Fgure 2.6 (Rght), the strbuton of the hyper-parameters an can control the topc strbuton an the wor strbuton on topcs by the Drchlet strbuton Dr ( ) an Dr ()[30]. An then, an coetermne Hgh Performance Latent Drchlet Allocaton for Text Mnng 36

Chapter 2 Hgh Performance Latent Drchlet Allocaton for Text Mnng 37 each wor n the ocument. For the ocument, the jont strbuton of all the known an latent varables s as follows: ) ( ) ( ) ( ) ( ),,,, (, 1 ) (,, n N n z n p z p w p p z w p n (2.22) After elmnatng the varables z,,, the probablty strbuton of w s obtane as below: n N n z n w p p p w p 1 ) (, ), ( ) ( ) ( ), (, (2.23) Therefore, for the whole text set D, ts corresponng probablty strbuton s as follows: D N n z n D n w p p p w p D p 1 1 ) (, 1 ] ), ( ) ( ) ( [ ), ( ), (, (2.24) 2.2.5.3 The Avantages of LDA The LDA moel s the total probablty generatve moel so that t has a clear nternal structure, an can use effcent probablty nference algorthms to calculate moel parameters [30] [39]. The sze of the LDA moel parameter space has nothng to o wth the number of tranng ocuments. Therefore, t s more sutable for hanlng large-scale corpus [29]. The LDA moel brngs n the hyper-parameter to ocument-topc ter, whch s better than PLSI [40]. The pror nformaton s ae, whch means that the parameters can be seen as the ranom varables. Beses, t makes LDA become a herarchcal moel wth more stable structure, avong the overfttng [39].

Chapter 2 2.3 An Overvew of the Man Inference Algorthms of LDA Moel Parameters Accorng to the assumptons whch were suppose by Ble et al [30], when LDA moels the ocuments, the topc strbuton parameter an Drchlet strbuton parameter are fxe n the generatve process of text sets. So, for the ocument, the posteror probablty strbuton of ts latent varables s as follows: p(, z, w, ) p(, z w,, ) (2.25) p( w, ) In the above equaton, the normalze tem p ( w, ) can be calculate by Equaton (2.23). After brngng n Dr ( ) an Dr (), a new equaton can be obtane: ( ) p( w, ) ( ) ( T 1 1 )( N T N n1 1 j1 ( ) j w j n ) (2.26) It can be seen that for a ocument, the posteror probablty strbuton an of ts latent varables an cannot be gane by calculaton rectly. Usually, there are three man nference methos to estmate the approxmate value an nfer the moel parameters, such as Varatonal Inference (VI) [31] [32], Belef Propagaton (BP)[106] an Gbbs Samplng (GS) [107] [108]. Fnally, the latent topc structure of ata an the mportant nformaton of ocuments are obtane. 2.3.1 Varatonal Inference (VI) Ble et al [30] efne a strbuton q (, z, ) an brought n the varatonal parameters an. An then, the varatonal parameters are use to rewrte the posteror probablty strbuton of the latent varables an z, to approxmate the Hgh Performance Latent Drchlet Allocaton for Text Mnng 38

Chapter 2 real posteror probablty strbuton. The expresson of the rewrtten posteror strbuton s as follows: N q(, z, ) q( ) q( ) (2.27) n1 z n n where an are the parameters of q. In aton, VI converts an orgnal complex graphcal moel lke Fgure 2.6 (Rght) to a smple graphcal moel whch s shown n Fgure 2.7. Fgure 2.7 s the graphcal moel representaton of the varatonal strbuton use to approxmate the posteror n LDA. The boxes are plates that represent replcates. The outer plate represents ocuments, an the nner plate represents the repeate choce of topcs wthn a ocument. In aton, the hollow crcles represent the latent varables, where an are the varatonal parameters an an z are the latent varables. An, the black arrow represents the contonal epenency between two varables, namely the contonal probablty. Assume that an z n are nepenent wth each other, the wor varable w s remove [32]. z Document Collecton Fgure 2.7: The LDA probablstc graphcal moel wth the varatonal parameters. The whole approxmaton process can be seen as the process n whch to fn the Hgh Performance Latent Drchlet Allocaton for Text Mnng 39

Chapter 2 sutable varatonal parameters an makes the Kullback-Lebler Dvergence (KLD) [109] between the estmate posteror strbuton an the real posteror strbuton mnmum, whch can be expresse as the followng equaton: * * (, ) arg mn D( q(, z, ) p(, z w,, )) (2.28) (, ) where the efnton of KLD s that for the screte strbuton P an Q, the KLD between them can be shown as below [110]: P( ) D( P Q) P( )ln (2.29) Q( ) The parameter estmaton process uses Varatonal EM to make the lower boun of the logarthmc smlarty maxmze. The specfc process can be ve nto the followng two steps: E-step: To mnmze the varatonal parameters of KLD can be obtane by the fxe-pont teraton metho [105]. So, ts upate equatons are as follows: n exp{ E [log( ) ]}, (2.30) wn q N (2.31) n 1 n T j 1 In Equaton (2.30), E [log( ) ] ( ) ( ), where stans for the q j frst partal ervatve of log functon. It can be gane by Taylor approxmaton [111]. For a ocument, the optmal varatonal parameters {, : D} can be foun accorng to Equatons (2.30) an (2.31). * * M-step: In orer to estmate the parameters an n the LDA moel, the parameters whch can make the log-lkelhoo functon of the ocument set maxmze nee to Hgh Performance Latent Drchlet Allocaton for Text Mnng 40

Chapter 2 be foun [28]. For a ocument set D, ts log-lkelhoo functon s efne as follows: D l(, ) log p(, ) (2.32) 1 w It s ffcult to get p (, ) by calculaton rectly, so the EM algorthm s use w to fn the parameters whch can make the lower boun of the functon n Equaton (2.32) maxmze as replace [111]. The parameter can be obtane by lagrange multpler metho [112], whch can be expresse as below: D N * j w (2.33) j 1 n1 n n where j expresses the probablty of the topc z on the wor j w, z means that the value of the topc varable z s an j w means that the value of the wor varable w s j. D represents the number of ocuments, N expresses the number of wors, n each ocument, * n means the -th optmzng varatonal parameter for each wor w n s a wor-level varable an s sample once for each wor n each ocument, an j w n represents that the value of w n s j. In aton, the parameter can be gane by the Newton-Raphson metho [113]. In bref, accorng to the obtane approxmate value n E-Step, the parameters an can be gane by maxmzng the lower boun of the functon n the VI algorthm. Hgh Performance Latent Drchlet Allocaton for Text Mnng 41

Chapter 2 2.3.2 Belef Propagaton (BP) Belef Propagaton (BP) s an algorthm whch uses the nformaton transfer moe to nfer graphcal moel parameters. It s an effectve metho of solvng the contonal margnal probablty. BP s apple to the graphcal moel contanng varables an factors, localzng an strbutng varables. In other wors, t strbutes the process of calculatng global ntegral to the nformaton transfer of each noe, an each noe s nteracte wth ther ajacent noes. Accorng to BP algorthm, the calculate amount can be change from exponental growth to approxmate lnear growth, makng parameter nference methos apply to complex moels. In 2011, Ja Zeng frst apple the BP algorthm to the parameter nference of the LDA moel [106]. Fgure 2.8 s a factor graph base on the LDA moel. The moel parameters an w are calle factors, whch are represente by squares. Other varables whch connect two factors are calle topc labels, whch are expresse by crcles [114]. Thus, the factor connects the topc labels of all the ocuments n the text set, an the factor w connects the topc labels of all the wors n the wor lst. connects two latent varables Z w, an Z w, Z,, where w stans for the collecton of all the wors except for w n the ocument beng assgne the topc labels. So, connects the topc labels of fferent wors n the same one ocument. In aton, w connects two topc labels Z w, an Z w Z,, where w, means the collecton of the topc labels of the same one wor n all the ocuments except for the ocument. So, w connects the collecton of the topc labels of the same one wor n all the ocuments. The observatonal varables w an n LDA classc graphcal moel are hen n the factors an w. Hgh Performance Latent Drchlet Allocaton for Text Mnng 42

Chapter 2 Z w, w Z w, Z w. Fgure 2.8: Belef propagaton n the LDA moel base on the factor graph. As shown n Fgure 2.8, the hollow crcles represent the latent varables, but the squares represent two factors an w. In aton, the black arrows stan for the recton of nformaton transfer. When BP algorthm s use to estmate the LDA parameters, the topc labels nformaton of wors n ocuments wll be etermne by the nformaton of the connecte factors an w. In other wors, t wll be etermne by the nformaton of fferent wors n the same one ocument an the nformaton of the same one wor n fferent ocuments n the prevous teraton process [106]. The man steps of the BP algorthm are escrbe as follows: Step 1: The parameters,, T, K an ( ) are ntalze, where an w, k are the latent parameters n LDA, T represents the teratons of the algorthm convergence, K represents the number of topcs, an ( ) represents the nformaton content. w, k Step 2: For all the ocuments n a text set, K nformaton content s calculate from each wor. An, the K nformaton content s save to upate ( ). w, k Hgh Performance Latent Drchlet Allocaton for Text Mnng 43

Chapter 2 Step 3: The upate w, ( k) s use to upate the moel parameters an w. At the moment, t t 1, where t represents the current number of teratons, an ts ntal value s zero. Step 4: If t T, repeat step 2 an step 3. When t T, the BP algorthm s converge an en. Step 5: The fnal results of an w are output. The man equaton of the nformaton upatng s as follows: t t t w, ( k) w, ( k) 1 w, ( k) (2.34) t t [ ( k) ] [ ( k) ] k w, k w, In contrast, for the generate nformaton at every tme, the nformaton n the BP algorthm s also transmtte from varables to factors [106]. Accorng to Equatons (2.35) an (2.36), the nformaton can be transmtte to factors an w retroactvely. Therefore, t s a mutual process. After much teraton, the algorthm can acheve convergence fnally, obtanng the parameters of the LDA moel an as below: [, ( k) ] ( k) (2.35) [ ( k) ] k, [ w, ( k) ] w ( k) (2.36) [ ( k) ] k w, 2.3.3 Gbbs Samplng (GS) Grffths et al [82] propose Gbbs samplng whch was a kn of Markov Chan Monte Carlo (MCMC) methos, whch s easy to mplement an can extract topcs from large-scale corpus very effectvely. Beses, ts perplexty an operatng spee Hgh Performance Latent Drchlet Allocaton for Text Mnng 44

Chapter 2 are better than other algorthms. Thus, the Gbbs samplng algorthm has become the most popular extracton algorthm of the LDA moel [111]. Gbbs samplng s a smple realzaton form of MCMC [115] [116]. Its basc ea s to structure the Markov chan whch converges to a target probablty strbuton an extract the samples whch are consere close to the values of the probablty strbuton from the chan by samplng from the posteror probablty strbuton. The key of usng Gbbs samplng to nfer LDA moel parameters s that the moel parameters are not obtane rectly, but they are gane by samplng the topcs of each wor nrectly [107]. In other wors, the values of moel parameters an can be obtane nrectly by calculatng the posteror probablty. The etals of Gbbs samplng algorthm wll be ntrouce n Secton 3.3.1 n Chapter 3. 2.3.4 Analyss an Dscusson The obtane moel by varatonal nference has some fference wth the real stuaton. An, EM algorthm often cannot fn the optmal soluton because of the local maxmzaton problem of the lkelhoo functon. Thus, ts accuracy s worse than GS an BP. Beses, VI algorthm brngs n the complex gamma functon, makng teraton an upatng parameters nee more tme an the computatonal effcency reuce greatly. BP algorthm uses the ea of nformaton transfer to solve LDA moelng problems an keep all the nformaton n the process of nformaton transfer. Its accuracy s very hgh, but t has some problems such as takng too much memory an low effcency. In aton, the stues showe that when processng large-scale real ata, the moel convergence of GS was faster than varatonal EM an BP [72]. The scalablty of MCMC was better than varatonal EM an BP. Furthermore, many parallel LDA algorthms mae use of GS [117]. Therefore, Gbbs samplng algorthm s use n ths thess, solvng the parameters nference problem of the LDA moel. Hgh Performance Latent Drchlet Allocaton for Text Mnng 45

Chapter 2 2.4 An Overvew of Genetc Algorthm 2.4.1 The Basc Iea of Genetc Algorthm Genetc Algorthm (GA) s a prouct whch combnes lfe scences an engneerng scence, s a calculaton moel of the bologcal evoluton process of smulatng genetc selecton an natural elmnaton. It was propose by professor Hollan at the Unversty of Mchgan n 1969, an he publshe an nfluental book whch was calle Aaptaton n Natural an Artfcal Systems [118]. After the generalzaton an summarzaton by Dejong an Golberg et al, a new global optmzaton search algorthm was put forwar [119]. In aton, GA was erve from Darwn s evoluton theory, Wezmann s speces selecton theory an Menel s populaton genetcs theory. Hollan not only esgne the smulaton an the operatng prncple of GA, usng statstcal ecson theory to analyze the search mechansm of GA, but also establshe the famous schema theorem an mplct parallelsm prncple, layng the founaton for the evelopment of GA [120] [121]. The basc ea of GA s to smulate breeng, crossng an mutaton n the process of natural selecton an natural genetc [118]. When ealng wth the practcal problem, every possble soluton of the problem wll be coe nto a chromosome. A number of chromosomes consttute the populaton. Each chromosome can get a numercal valuaton by the ftness functon, elmnatng the nvuals wth low ftness an choosng the nvuals wth hgh ftness to partcpate n the genetc operaton. After crossover an mutaton operators, these chromosomes wll generate the next generaton of new populaton. The performance of the new populaton of chromosomes s better than the prevous generaton because the new populaton nherts some excellent features from the prevous generaton. The optmal soluton of the problem s foun by generaton after generaton untl the conton of convergence s met or the pre-set number of teratons s acheve [119] [121]. Therefore, GA can be seen as a kn of teratve algorthms, an a graual evoluton process of the Hgh Performance Latent Drchlet Allocaton for Text Mnng 46

Chapter 2 populaton whch are compose by feasble solutons, whch has a sol bologcal founaton [122]. 2.4.2 The Man Steps of Genetc Algorthm Snce Hollan systematcally put forwar the complete structure an theory of GA, many scholars propose varous kns of mprove genetc algorthms. The genetc algorthm whch was propose by Hollan s usually calle Smple Genetc Algorthm (SGA) [118] [122]. SGA manly nclues cong mechansm, ftness functon, selecton, crossover an mutaton. In the practcal applcaton, the esgn of GA s consere from the above fve basc factors. The specfc etals are as follows: 2.4.2.1 Cong Mechansm The structure of many applcaton problems s very complex, but they can be reuce to a smple cong representaton of bt strng form. The process of makng the problem structure convert nto the cong representaton of bt strng form s calle cong. It s the bass of GA, whch nfluences the performance of the algorthm greatly. The cong representaton of bt strng form s calle chromosomes or nvuals. The collecton of strngs consttutes the populaton, an nvuals are strngs. For example, n the optmzaton problem, a strng correspons to a possble soluton. In the classfcaton problem, a strng can be nterprete as a rule. In SGA, the character set s compose of 0 an 1, an the cong metho aopts bnary cong. But for the general GA, they can be not lmte. In aton, many scholars have one varous mprovements about the cong of GA, such as Gray cong, ynamc parameter cong, float-pont cong, ecmal cong, symbolc cong, mult-value cong, Delta cong, hybr cong, DNA cong an so on [122]. Hgh Performance Latent Drchlet Allocaton for Text Mnng 47

Chapter 2 2.4.2.2 Ftness Functon In orer to reflect the aaptve capacty of chromosomes, a functon whch can measure every chromosome n the problem s brought n, whch s calle ftness functon. When searchng the possble solutons, GA bascally oes not use the external nformaton. Accorng to the ftness functon, the values of the ftness of each nvual n the populaton are use to search. The ftness functon s the crteron whch evaluates each nvual n the populaton, reflectng the prncple of survval of the fttest n natural evoluton. For optmzaton problems, the ftness functon s usually the objectve functon [121]. The ftness functon shoul reflect the fference between each chromosome an the chromosome wth the optmal soluton of the problem effectvely. If the fference s smaller, the fference between ther respectve values of the ftness functon wll be smaller, or t wll be bgger. When usng GA to solve specfc problems, the choce of the ftness functon wll nfluence the convergence of the algorthm an the convergence rate greatly. Thus, for fferent problems, the relate parameters are etermne as a matter of experence [121]. 2.4.2.3 Selecton Selecton operaton s also calle copy operaton, ts role s to ece whether nvuals are elmnate or nherte n the next generaton accorng to the values of the ftness functon of nvuals n the populaton. In general, selecton makes the nvual wth bgger value of the ftness have bgger chances of survval, but the nvual wth smaller value of the ftness have smaller chances of survval. Selecton operator can ensure that excellent nvuals transmt contnuously n genetc operatons, makng the whole populaton evolve to an excellent populaton. SGA makes use of the roulette wheel selecton mechansm. In aton, Potts et al summarze about 20 other selecton methos, such as ranom traversal samplng Hgh Performance Latent Drchlet Allocaton for Text Mnng 48

Chapter 2 metho, local selecton metho, champonshps selecton metho, steay state copy, optmal strng copy, optmal strng retenton, ftness lnear scale varaton an so on [123]. 2.4.2.4 Crossover Crossover operaton s to replace an recombne part of the structure of two parent nvuals to generate new nvuals. Its purpose s to compose new nvuals, an search n the strng space effectvely meanwhle reuce the probablty of falure to the effectve moel. Crossover operaton s an mportant feature whch can make GA ffer from other evolutonary algorthms [122]. Varous crossover operators consst of two basc contents [124]: frstly, n the forme populaton by selecton operaton, nvuals wll ranom par. An, accorng to a preset crossover probablty P c, each par of nvuals wll be etermne that whether they nee crossover operaton or not. If the value of P c s bgger, the probablty of gene exchange wll be hgher, or t wll be lower. Seconly, f the cross pont s set, the part of the structure of the match par of nvuals before an after the pont wll nterchange. In SGA, sngle pont crossover s use. In aton, Potts et al generalze almost 17 crossover methos, such as two pont crossover, unform crossover, multpont crossover, heurstc crossover, sequence crossover, mxe crossover an so on [123]. 2.4.2.5 Mutaton Mutaton operaton s to change the values of some genes of nvuals n the populaton ranomly by a very small mutaton probablty P m. Its am s also to generate new nvuals, overcomng the problem of premature convergence cause by losng effectve genes. It nclues etermnng the locaton of the change pont an replacng genc values. In general, the mportance of mutaton operaton ranks only Hgh Performance Latent Drchlet Allocaton for Text Mnng 49

Chapter 2 secon to the crossover operaton, so ts role cannot be gnore [118]. In the SGA, for the bnary cong, the strngs of nvuals n the populaton are selecte by P m, an these genes wll be change. For example, f the value s 1, t wll be change to 0. Or f the value s 0, t wll be change to 1. Furthermore, Potts et al summarze three mutaton technologes nclung management mutaton, change mutaton probablty an sngle value operaton. Beses, they also summe up a varety of mutaton operatons, such as basc bt mutaton, unform mutaton, Gaussan mutaton, non-unform mutaton, aaptve mutaton, mult-level mutaton, an so on [123]. 2.4.3 The Parameter Settngs of Genetc Algorthms Generally, the parameter settngs of GAs contan populaton sze, crossover probablty an mutaton probablty, etc, whch can nfluence accuracy, relablty, computng tme an system performance of GAs [118]. At present, the reasonable selecton of the parameter settngs of GAs lacks relevant theory as the guance so that the parameters are usually set accorng to the experence or experments [121]. The total number of chromosomes n a populaton s calle the populaton sze. If the populaton sze s too small, the search space of GAs wll be lmte an t s ffcult to fn out the optmal soluton. On the contrary, f t s too bg, the convergence tme wll ncrease an the program runnng tme wll rse [118]. Thus, fferent problems may have ther own approprate populaton szes. Usually, the populaton sze s set from 20 to 200 [125]. In SGA, crossover probablty P c an mutaton probablty P m are fxe, whch are etermne by the experence [122]. When a populaton sze s set from 20 to 200, Schaffer suggeste that the optmal parameter value range of P c woul be set between 0.5 an 0.95 [122]. If the value of P c s bgger, the spee of generatng new Hgh Performance Latent Drchlet Allocaton for Text Mnng 50

Chapter 2 nvuals wll be faster. However, f t s too large, the probablty of amagng the genetc moel wll be greater. If t s too small, the search process wll be slower even stagnant. In aton, the value of P m s usually set between 0.01 an 0.05 [122]. If the value of P m s too small, t wll be ffcult to generate new genetc structures. But f t s too large, GAs wll become smple algorthms of ranom search. 2.5 An Overvew of Haoop Haoop [196] was a strbute computng open source platform, whch was evelope by Doug Cuttng. By usng a smple programmng moel n a computer cluster, strbute applcaton programs can be wrtten an carre out to process large-scale ata [60] [61] [62]. It orgnate n a subproject Nutch n Apache Lucene, whch accomplshe MapReuce [66] algorthm an esgne ts own Haoop Dstrbute Fle System (HDFS) accorng to Google GFS [54]. Haoop s manly compose of HDFS an the strbute computng framework MapReuce. The former s the open source mplementaton of Google GFS. The latter s the open source mplementaton of Google MapReuce. Furthermore, Haoop also nclues other subprojects, such as Avro, Core, Zookeeper, HBase, Hve, Pg, ChuKwa [60]. Both of HDFS an MapReuce of Haoop use master/slave archtecture [64]. Masternoe s compose of Namenoe an Jobtracker, an proves some tools an opens the browser vew functon of the Haoop status. Beses, Slave noe s mae up of Datanoe an Tasktracker. Usually, the master/slave archtecture has one masternoe an several slavenoes. The masternoe s a manager or controller, an the slavenoe accepts the control of the masternoe. In bref, the master/slave archtecture of Haoop s manly reflecte n two key technologes: HDFS an MapReuce [68]. Namenoe an Datanoe are responsble for completng jobs of HDFS, an Jobtracker an Tasktracker are responsble for fnshng jobs of MapReuce. Hgh Performance Latent Drchlet Allocaton for Text Mnng 51

Chapter 2 2.5.1 HDFS HDFS s manly use to store all the ata n the cluster noes, an t can acheve the growth of storage capacty an computng power by only ang the number of computers. It s evelope by Java language so that t can run n many low-cost computers an has strong fault tolerance. Thus, t proves basement support for strbute computng an storage [60] [62]. A HDFS cluster s often mae up of one Namenoe an a certan number of Datanoes. Usually, each noe s an ornary computer [65]. The Namenoe s a central server, whch manages the Namespace of the fle system an the clent accessng to fles. The Datanoe manages the storage n local noe. The ata of users are ve nto lots of ata blocks whch are store n fferent Datanoes spersely. Datanoes woul report the lstng of blocks to the Namenoe perocally. Both of Namenoe an Datanoe are esgne to run n normal low-cost computers wth Lnux [67]. The Namenoe s the core of the whole HDFS, whch carres out the Namespace operatons n the fle system, such as open, close, rename a fle or rectory. An, t s also responsble for etermnng that ata blocks map to specfc Datanoes [63]. The Datanoe s responsble for ealng wth the rea-wrte requests of the clent n the fle system, an executng create, elete an copy to ata blocks uner the unfe scheulng of the Namenoe. In a gven pero, the Datanoe gves the Namenoe heartbeat reports. An then, the Namenoe controls Datanoes by the heartbeat reports, an tests health status of Datanoes contnuously. For users, they only nee to use the Namespace whch s prove by HDFS to access an mofy fles, but they o not nee to know that whch Datanoe oes rea-wrte the ata an whch Datanoe stores the backup ata [65]. When users carry out fle operatons n HDFS, the effect wll be the same wth fle operatons n a stan-alone Hgh Performance Latent Drchlet Allocaton for Text Mnng 52

Chapter 2 system. For example, n a typcal cluster, a computer runs only one Namenoe, an other computers run one Datanoe respectvely [65]. 2.5.2 MapReuce Programmng Moel n Haoop The framework of MapReuce n Haoop nclues one Jobtracker an a certan number of Tasktrackers [62] [68], whch s shown n Fgure 2.9. The Jobtracker often runs on the computer whch has the Namenoe. In a cluster, each of other noes runs one Tasktracker. The Jobclent s responsble for submttng the user-efne Mapper class an Reucer class to the system, an settng varous system parameters. It sens MapReuce jobs to Jobtrackers. After acceptng the task, Jobtrackers are responsble for the task scheulng of work noes, montorng them perocally an hanlng fale tasks. The Tasktracker s responsble for reportng the process nformaton to Jobtrackers perocally an executng specfc Map/Reuce operatons [64]. Fgure 2.9: The typcal MapReuce framework n Haoop. Hgh Performance Latent Drchlet Allocaton for Text Mnng 53

Chapter 2 The programmng framework of MapReuce ves a task nto two phases, Map phase an Reuce phase. An, the processe ata structures n two phases are (key, value) pars [52] [68]. Users only nee to efne ther own Map functon an Reuce functon respectvely, an then gve them to Haoop. But, they o not nee to care about complex basement etals, such as task strbutng, concurrency control, resource management an fault tolerance. Map phase [65] [66]: frstly, the framework of MapReuce ves the nput ata nto many ata blocks whch are nepenent wth each other so that Map operatons can acheve concurrent executon completely. An then, each block wll be assgne to fferent Map tasks. Each Map task wll rea-n the assgne blocks n the form of (k1, v1). After processng by Map functon, ntermeate results are generate. Fnally, ntermeate results wll be store n local ntermeate fles. Here, ntermeate results are output n the form of (k2, v2). Reuce phase [65] [66]: the number of Reucer can be esgne by users, an t can be one or more than one. In the Reuce process, because ts nput s from the ntermeate results whch are generate by Map tasks that strbute n multple noes, the ntermeate results nee to be cope n the local fle system of the noes wth Reuce tasks by the network transmsson way. In the output result from Mappers, Haoop makes values wth all the same keys store n an terator an transmt to Reucers. After Reucers recevng the nput, values wth all the same keys wll be merge. At the en, the merge result wll be output nto the specfe output fle. 2.5.3 Haoop Scheulng Algorthms The job scheulng n Haoop s a process of the Jobtracker assgnng tasks to the corresponng Tasktrackers to run [66]. When multple tasks runnng at the same tme, Hgh Performance Latent Drchlet Allocaton for Text Mnng 54

Chapter 2 the sequence of job executon an the allocaton of computng resource wll affect the overall performance of the Haoop platform an the utlzaton rate of the system resources. The scheuler s a pluggable moule. Users can esgn the scheuler accorng to ther practcal applcaton requrements, an then specfy the corresponng scheuler n confguraton fles. When a Haoop cluster starts, the scheuler wll be loae. At present, Haoop has several scheulers, such as Frst In Frst Out (FIFO), far scheuler an capacty scheuler [60] [70]. The FIFO scheuler s a Haoop efault scheuler. Its basc ea s to put all the MapReuce jobs nto the only one queue. An then, the Jobtracker wll scan the whole job queue by the prorty sequence or the submsson tme sequence. Fnally, a job whch meets the requrement s chosen to run [60]. Far scheuler was frst propose by the Facebook. Its purpose s to make the MapReuce programmng moel n a Haoop cluster eal wth the parallel processng of fferent types of jobs, an guarantee that each job can get equal resources as far as possble. Capacty scheuler was put forwar by Yahoo, an ts functons are smlar to far scheulng. But, there are many fferences n ther esgn an mplementaton [52]. FIFO s smple an clear, but t has large lmtaton. For example, t gnores the fferences of operatonal requrements. The last task n the scheulng queue has to wat for that all prevous tasks complete job executons, nfluencng the overall performance of the Haoop platform an the utlzaton rato of the system resources serously. Far scheuler has some mprovement, but t cannot support large memory operatons. Compare wth FIFO, capacty scheuler s able to optmze the resource allocaton effectvely, enhancng the operatng effcency of jobs. However, t must confgure lots of parameter values manually, whch s qute ffcult to normal users who o not unerstan the overall stuaton of the cluster an the system [68] [70]. In aton, [97] ponte out the savantages of exstng scheulers n Haoop. Haoop scheulng mechansm s base on the followng assumptons [97]: Hgh Performance Latent Drchlet Allocaton for Text Mnng 55

Chapter 2 All the noes run by the same spee. In the whole operatonal process, tasks run by the constant spee. There s no overhea when startng backup tasks. The task progress can represent the work whch has been one. In the Reuce phase, copyng, sortng an mergng of tasks take up one thr of the whole completon tme respectvely. Assume contons 1 an 2 are base on a homogeneous envronment, but they are not effectve n a heterogeneous envronment. So, exstng scheulers n Haoop have not consere the fferences of the processng spee between fferent noes n a heterogeneous envronment. Beses, assume contons 3 an 4 are even nval n a homogeneous envronment. [97] ponte out reasons that these assumptons are nval one by one. To sum up, exstng scheulers n Haoop are sutable for a homogeneous envronment. But n a heterogeneous envronment, they wll waste a lot of resources an spen much overhea, affectng the overall performance of the Haoop platform. 2.5.4 Dvsble Loa Theory Loa scheulng nclung the vson an the transmsson of the loa s one of key factors whch nfluence the performance of parallel an strbute computng [126]. Loa scheulng problem can be ve nto two categores: non-vsble loa scheulng an vsble loa scheulng. Some parts of non-vsble loa jobs cannot be ve, but they only can be processe by a processor as a whole [130]. For applcatons, basc assumptons n the vsble loa theory are that applcatons are able to be ve nto any sze of subtasks. They are nepenent wth each other, an they can run the parallel executon completely. Thus, these applcatons are calle vsble loa [127] [128] [129]. Hgh Performance Latent Drchlet Allocaton for Text Mnng 56

Chapter 2 An mportant feature of vsble loa scheulng s that ata communcaton only occurs n two phases whch are before the start of the calculaton an the en of the calculaton. Its am s to assgn a gven applcaton to each processor, makng the completon tme of the whole applcaton the shortest [131] [132]. In aton, the vsble loa theory proves a useful framework for the vsble loa scheulng n a heterogeneous envronment. For a gven task, f processng noes an the network heterogenety are consere fully, the vsble loa theory wll prove an effcent task vson moe an scheulng moe [133] [134] [135]. In Haoop, the features of the PLDA algorthm nclue that frstly, the nput ata can be ve nto any sze of sub-ata. Seconly, the ata communcaton only occurs n two phases whch are ata strbuton phase an returnng results from processors phase. It can be seen that the PLDA algorthm n Haoop s completely n conformty wth assumptons of the vsble loa theory. In concluson, consere that varous noes n a heterogeneous cluster have fferent computng spee, fferent communcaton capablty an fferent storage capacty, a statc loa balancng strategy base on genetc algorthm for PLDA wll be propose n Chapter 5 n ths thess. In ths allocaton strategy, because t combnes wth the optmal prncple of the vsble loa theory, all the ata noes shoul complete local Gbbs samplng at the same tme theoretcally, makng the completon tme of parallel LDA mnmum an mprovng the performance of Haoop clusters n a heterogeneous envronment. 2.6 Summary At the begnnng, ths Chapter ntrouce the basc concepts an the moelng process of LSI, PLSI an LDA, an analyze an compare ther avantages an savantages. LDA topc moel s a completely probablty generatve moel, whch Hgh Performance Latent Drchlet Allocaton for Text Mnng 57

Chapter 2 can use mature probablty algorthms to tran moels rectly. An, t s easy to make use of the moel. The sze of ts parameter space s fxe, whch has nothng to o wth the scale of text sets. So, t s more sutable for large-scale text sets. Beses, LDA s a herarchcal moel, whch s more stable than the non-herarchcal moel. By escrbng an comparng man nference algorthms of LDA moel parameters lke varatonal nference, belef propagaton an Gbbs samplng, the research work of ths thess wll be focuse on the LDA moel wth Gbbs samplng algorthm. An then, basc concepts, man steps an parameter settngs of genetc algorthm were ntrouce. Next, the relate concepts of Haoop were presente, whch contanng HDFS, MapReuce programmng moel an Haoop scheulng algorthms. At present, Haoop exstng scheulers are sutable for a homogeneous envronment. But n a heterogeneous envronment, they wll waste lots of resources an spen much overhea, whch affects the overall performance of the Haoop platform. Fnally, the vsble loa theory was recommene as a theoretcal bass of the propose statc loa balancng strategy n Chapter 5. Hgh Performance Latent Drchlet Allocaton for Text Mnng 58

Chapter 3 Chapter 3 Text Classfcaton wth Latent Drchlet Allocaton 3.1 Introucton Automatc text classfcaton s a research hotspot an core technology n the fel of nformaton retreval an text mnng. Recently, t has got wesprea concern an mae remarkable progress. It s one of the hotspots an key technologes of nformaton retreval, machne learnng an natural language processng [75] [76]. Its am s to fn classfcaton rules from the known tranng text set, obtan a learner, an make the learner classfy the unknown new text wth goo precton accuracy [136]. Ths chapter begns wth an overvew of several key technologes of text classfcaton. Then, a text classfcaton metho base on the LDA moel s propose, whch uses LDA topc moel as a text representaton metho. Each text s represente as a probablty strbuton on a fxe latent topc set. SVM s chosen as a classfcaton algorthm. Fnally, the propose metho shows a goo classfcaton performance, an ts accuracy s hgher than other two methos. 3.2 Overvew of Text Classfcaton Text classfcaton s a process n whch a category label s ae automatcally or manually nto the gven text accorng to ts content. By classfyng text, users can not only fn neee nformaton more accurately, but also can browse nformaton qute convenently [13] [14]. Automatc text classfcaton s a supervse learnng process. Accorng to the gven tranng ocument collecton, a relatonal moel between ocument feature an ocument categores s bult by some metho. An then, the relatonal moel s use to classfy the gven unknown ocuments [76]. Hgh Performance Latent Drchlet Allocaton for Text Mnng 59

Chapter 3 From the vew of mathematcs, text classfcaton s a mappng process, whch maps the unclassfe text to an exstng categorzaton [75]. Ths mappng can be a one-to-one mappng, or a one-to-many mappng, because a ocument can be assocate wth multple categores. The mappng rule of text classfcaton s that the system summarzes the classfe rule an establshes scrmnaton equatons an scrmnaton rules accorng to nformaton of several samples n each exstng category. The system etermnes ther classes when ealng wth new ocuments base on the scrmnaton rules. Overall, automatc text classfcaton conssts of fve parts as shown n Fgure 3.1: text preprocessng, text representaton moel, text feature extracton an menson reucton, text classfcaton moel, an classfcaton performance evaluaton [75] [137]. In ths chapter, the Vector Space Moel (VSM) [23] [138] [139] s chosen as text representaton, selectng wors as the feature, makng text collecton structure a hgh-mensonal an sparse wor-ocument matrx. Fgure 3.1: The typcal process of automatc text classfcaton. Hgh Performance Latent Drchlet Allocaton for Text Mnng 60

Chapter 3 Dmensonalty reucton of wor-ocument matrx s benefcal to mprove the effcency an performance of the classfer. Therefore, one of the most typcal probablstc topc moels, Latent Semantc Inexng (LSI), s use to map ata from the orgnal hgh-mensonal space to a new low-mensonal space by space transformaton [20] [21] [22]. It s a metho whch can explore latent semantc relaton among wors accorng to the wor co-occurrence nformaton. Its basc prncple s to use the SVD technque n matrx theory to transform a wor frequency matrx nto a sngular matrx. It only keeps K maxmum values from the sngular value vector, mappng text vector an query vector from the wor space to a K-mensonal semantc space. The menson reucton effect of LSI n text classfcaton s remarkable, but t may flter out very mportant features for rare category n the whole ocument collecton, leang to mpact on the fnal classfcaton performance [24] [98]. Moreover, the complexty of the algorthm mplementaton s another problem of the LSI moel, whch cannot be gnore [140]. Therefore, ths chapter presents a text classfcaton metho base on LDA. LDA s a probablstc growth moel, whch moels screte ata sets such as text collecton. It s a three-ter Bayesan moel, contanng wors, topcs an ocuments three-ter structure [30] [31]. Each ocument s expresse as a mxture of topcs, an each topc s a multnomal strbuton on a fxe vocabulary. These topcs are share by all ocuments n the collecton. Each ocument s generate by samplng a specfc topc proporton from the Drchet strbuton. SVM s chosen as classfcaton moel, because ts separate surface moel can overcome sample strbuton, reunant features, over-fttng an other factors effectvely [141] [142] [143]. In aton, the stuy showe that SVM has the avantages of effcency an stablty compare wth Nave Bayes, lnear classfcaton, ecson trees, KNN an other classfcaton methos [144] [145] [146]. Hgh Performance Latent Drchlet Allocaton for Text Mnng 61

Chapter 3 3.2.1 The Content of Text Classfcaton Text classfcaton s a process whch classfes a large amount of ocuments nto one or more categores on the bass of ocument s content or propertes. The content can be mea news, technology reports, emals, technology patents, web pages, books or part of t [7] [13]. Text classfcaton concerns text categores nclung topcs of text (such as sports, poltcs, economy, arts, etc.), the style of text, or the relatonshp between text an other thngs. If ocuments are classfe manually, people nee to rea through all the artcles, then classfy an keep them. It requres many experts wth rch experence an specalze knowlege to o lots of work. Obvously, ths process has some savantages such as long cycle, hgh cost an low effcency. It s ffcult to meet the real nees [75] [147]. Thus, ths chapter focuses on how text can be classfe automatcally. 3.2.2 Text Preprocessng The purpose of text preprocessng s to structure the unstructure text, an keep the key wors whch are useful to stngush the categores of text topcs [148]. Its process nclues text representaton, removng functon wors an content wors that are not useful to represent category. Generally, natural language text contans a large number of functon wors whch have no specfc meanng, such as prepostons, artcles, conjunctons, pronouns, etc. Beses, there are some common content wors whch have very hgh frequency n almost all of the text, can not stngush text, an even nterfere wth keywors, reucng the processng effcency an accuracy of the classfcaton system. So they shoul be fltere out. As a result, removng the non-feature wors can be one through the followng two steps [149]: Frstly, accorng to a wor segmentaton system, part of speech s tagge, an then all the functon wors are elete, such as artcles, prepostons, averbs, Hgh Performance Latent Drchlet Allocaton for Text Mnng 62

Chapter 3 moal verbs, conjunctons an pronouns, etc. Seconly, a stop wors lst s establshe, an then those content wors whch have hgh frequency n all categores are put n that lst. The wors an phrases n the stop wors lst are fltere out whle analyzng the text. After the above two steps, the text becomes a sequence of feature wors. For each ocument, the frequency of occurrence of each wor s counte. 3.2.3 Text Representaton Text representaton s to use a certan feature wors to represent the ocument [150]. In text classfcaton, only the feature wors nee to be processe n orer to accomplsh the processng of the unstructure text, whch s a converte step from unstructure text to structure text. Therefore, establshng a text representaton s an mportant technology, relatng to how ocuments are apple to the classfcaton system. As we know, the computer oes not have the human ntellgence. After reang the artcle, people can have fuzzy unerstanng for ts content base on ther own comprehensve ablty. But the computer can not really unerstan the artcle. Hence, the text tself can not be rectly use for classfcaton. It must be represente as a mathematcal moel to facltate machne processng [11]. In practcal applcaton, the relate text vector or wor frequency matrx s very large, an the processng of text nformaton can not be complete because of the epenence between wors. So generally, t assumes that wors are mutually nepenent n the process n orer to reuce the complexty of text nformaton processng [151]. Although ths assumpton s nconsstent wth practcal stuaton, the complexty of computng s greatly reuce for text classfcaton an the classfcaton performance has also been mprove sgnfcantly. Therefore, many classfcaton algorthms are base on ths assumpton as a prerequste. Ths metho s calle the bags of wors moel. Countng the frequency of occurrence of each wor n each ocument s the bass for moelng algorthms. Countng the frequency Hgh Performance Latent Drchlet Allocaton for Text Mnng 63

Chapter 3 of occurrence of all wors n all ocuments can prouce the wor frequency matrx. Text nformaton base on natural language that s ffcult to use numbers to escrbe s expresse as a mathematcal matrx form, whch can hanle text accorng to the matrx operatons [152]. Ths thess uses the vector space moel to represent text as a vector that s mae up of feature wors. VSM has goo computablty an operablty, s a text representaton metho wth goo performance n recent years, an s one of the most classc moels at present [138] [139]. It assumes that the categores of ocuments are often only relate to the frequency of occurrence of certan specfc wors or phrases n the ocument, but have nothng to o wth ther locaton or sequence n the ocument. It gnores the complex relatonshp among paragraphs, sentences an wors n the text structure. A ocument can be expresse as a mult-mensonal vector form that s compose of feature wors [23]. For example, ocuments an the whole feature set compose a matrx A a ) mn ( j together, n whch the value of a j stans for the weghts of the -th feature wor ( 0 m 1) n the j-th ocument ( 0 j n 1). Each row represents a feature wor an each column represents a ocument after feature selecton. Generally, the TP-IDF s use to o weght calculaton [103] [104]. After calculatng the weghts, the normalzaton processng s neee. Equaton (3.1) or (3.2) represents the fnal weght calculaton. r W ( t, ) r W ( t, ) r tf ( t, ) log( N / n r (1 log r 0.01) r [ tf (, ) log( N / n 2 [(1 log 2 t 2 1)] r tf ( t, )) log ( N / n r tf (, )) log 2 2 t 0.01) ( N / n )] where W ( t, ) s the weght of wor t n ocument. tf ( t, ) 2 (3.1) (3.2) s the wor frequency Hgh Performance Latent Drchlet Allocaton for Text Mnng 64

Chapter 3 of wor t n ocument. N s the total number of tranng samples. n t s the number of tmes of t appearng n the tranng text set. The enomnator s the normalzaton factor. 3.2.4 Text Feature Extracton an Dmenson Reucton For an ornary text, the wor frequency nformaton s extracte as the classfcaton feature [81] [154] [155]. However, the number of text wors s qute large, hence mensonalty reucton s neee [80] [81] [153]. There are two man reasons: Frstly, to ncrease the spee an mprove the effcency of classfcaton. Seconly, the values of fferent wors for classfcaton are fferent. For nstance, the contrbuton of some common wors n each of the categores to text classfcaton s small, but the contrbuton of some wors whch have a major proporton n certan specfc class an a mnor proporton n other classes to text classfcaton s large. In orer to mprove the classfcaton accuracy, those weak expressve wors shoul be remove or weakene for each class, screenng out the feature set for the class, so as to hghlght the effectve features of classfcaton. For example, the number of feature wors s often able to reach 4 10 or more, but n fact real key feature wors are just a small part of them. So, the really neee feature subspace s requre to be foun from the huge feature space. Ths thess uses a wrapper approach to reuce the space [154]. It maps ata from the orgnal hgh-mensonal space to a new low-mensonal space by space transformaton. The value of each menson n the low-mensonal space s obtane by a lnear or nonlnear transformaton from the orgnal space. It uses the lnear or nonlnear transformaton to make reunant or nval nformaton map to relatvely weak mensons, so as to ensure the effect of feature extracton [155]. Hgh Performance Latent Drchlet Allocaton for Text Mnng 65

Chapter 3 In bref, the basc ea of feature extracton an mensonalty reucton s as follows [80] [81] [155]: Intally, the feature set nclues all the wors appeare n the ocument. Vectorzng all tranng ocuments consttutes a vector matrx. Usng a probablstc topc moel makes the vector matrx transform nto a small matrx. At the moment, the space where the new matrx s new feature space. After that, t s trane by the classfer. To be classfe ocuments are transforme nto the vector of orgnal feature space. Then, the vector s transforme nto the vector of new feature space by the corresponng converson metho. At ths tme, ths vector of the new feature space s the fnal vector. Takng the LSI moel as an example, the etale escrpton s as follows: t assumes that D s the ntal tranng set of ocuments. t represents the ocument-wor matrx, whch s the matrx n the orgnal feature space. Each row of the matrx stans for a ocument vector. An t assumes that r s the rank of D. D can be expresse as D T UV through SVD, where s a r r matrx. Each value n the matrx correspons to the sngular value of D. U s a t r matrx compose of the left sngular column vector, V s a r matrx compose of the rght sngular column vector. The k largest sngular values an ther corresponng sngular value vector are only retane, obtanng an approxmate D whch can be represente k as D k T U kkvk. Ths value of k D s the optmal k approxmaton of D. At ths moment, U k that s a t k matrx can map the column vector of t l matrx to k-mensonal vectors: k U T k. The vectors obtane by the above converson can be use to tranng an classfcaton [20] [22]. Hgh Performance Latent Drchlet Allocaton for Text Mnng 66

Chapter 3 3.2.5 Text Classfcaton Algorthms The core ssue of text classfcaton s that how a classfcaton functon or classfcaton moel can be constructe from a large number of labele tranng samples accorng to certan strategy, an how ths moel can be use to map the unknown class text to the specfe class space. At present, common classfcaton methos base on VSM nclue the Naïve Bayesan algorthm, K-Nearest Neghbor (KNN), SVM, an so on [75] [76]. There are other methos base on machne learnng such as ecson trees an neural network methos. They have been wely apple to nformaton retreval, text classfcaton, knowlege management, expert systems an other areas. In terms of the accuracy of algorthms, SVM s the best, but t nees long tme an large computaton overhea [145]. KNN s secon, but ts computatonal tme ncreases lnearly when the tranng set grows. The Neural network classfcaton algorthm s better than the Naïve Bayesan algorthm n recall an precson. The Naïve Bayesan algorthm has a strong theoretcal backgroun, an ts computng spee s the fastest [146]. Vapnk et al. propose SVM, a metho of machne learnng accorng to the statstcal theory n 1995 [156]. Joachms ntrouce SVM nto text classfcaton, whch can make a text classfcaton problem transform nto a seres of secon-class classfcaton problems [141]. Its basc prncple s that the vector space s ve nto two sjont spaces frst. The features n the feature space become flat on two ses of the plane by constructng a hyperplane. The feature ponts of two ses of the plane belong to fferent classes so that the ponts of the space can be assgne to two fferent classes [144]. When SVM s use for text classfcaton, a vector space for SVM s constructe frst, whch means that all tranng texts are mappe to a vector space. Then, a hyperplane whch can classfy the tranng text correctly or approxmately correctly s bult n ths space. Ths hyperplane s neee to satsfy classfcaton features of the orgnal Hgh Performance Latent Drchlet Allocaton for Text Mnng 67

Chapter 3 tranng text as far as possble. Next, the text to be classfe s mappe to ths vector space by the same proceure. Fnally, to estmate the relatonshp between the mappe vector an the hyperplane can etermne whch class the text to be classfe belongs to [142] [145]. Fgure 3.2 shows the separatng hyperplane of SVM algorthm, where crcle an trangle respectvely stan for two classes of text. O1 s the classfcaton lne. H1 an H2 are respectvely the samples whch are the closest to the classfcaton lne n two classes, an are straght lnes whch parallelng to the classfcaton lne. The stance between them s calle classfcaton nterval. The optmal classfcaton lne not only can separate two classes correctly, but also can make the classfcaton nterval the largest. In Fgure 3.2, the sample ponts n H1 an H2 are calle the support vector. H1 H2 O1 Fgure 3.2: The separatng hyperplane of SVM algorthm. Suppose a gven set of tranng ata T x, y ), where 1,, n, ( x R s the Hgh Performance Latent Drchlet Allocaton for Text Mnng 68

Chapter 3 feature vector of the -th sample, 1, 1 y s the category tag of the -th sample. A real-value functon s foun on x X R so that ( x) sgn( g( x)) f. When g(x) s a lnear functon, g( x) ( wx) b, an functon f (x) etermnes classfcaton rules, whch s calle a lnear classfcaton learnng machne. When g(x) s a nonlnear functon, t s calle a nonlnear classfcaton learnng machne [141]. For the lnear classfcaton problem, the general form of lnear scrmnant functon n a -mensonal space s g( x) wx b, the hyperplane equaton s wx b 0. After normalzng the scrmnant functon, all samples satsfy g ( x) 1. If the hyperplane can classfy all samples correctly, t shoul satsfy the followng equaton: y [ wx b] 1 0, 1,, n (3.3) At ths pont, the classfcaton nterval s 2 / w. The largest nterval s obtane when the value of 2 1 2 1 w s mnmze so that ( w) w w w. If the hyperplane can 2 2 satsfy Equaton (3.3) meanwhle mnmzng (w), t s calle the optmal hyperplane [145]. It s a quartc optmzaton problem, so the above optmal hyperplane problem can be transforme nto the ual form of optmzaton problem by a Lagrange multpler metho. The equaton s 0, where s the n 1 y Lagrange coeffcent, 0, 1, n. To solve the followng maxmum functon, there s a unque soluton. n n n 1 ( ) y y j j ( x, x j ) (3.4) 2 1 1 j1 If * s the optmal soluton, w * n 1 * y x. Accorng to Kuhn-Tucker contons Hgh Performance Latent Drchlet Allocaton for Text Mnng 69

Chapter 3 [141], ths optmzaton problem must meet ( y ( wx b) 1) 0, where 1,,n. From the above equaton, the samples whch are away from the hyperplane corresponng to must be zero. But non-zero corresponng to samples must be locate on the hyperplane, whch are calle support vectors. The value of * b can be brought n the optmal hyperplane equaton by any support vector, 1 * * obtanng b ( w x r x s ), where x r an 2 classes respectvely an they meet optmal classfcaton functon can be got as follows: n r x s are any support vectors from two >0, y 1 ; s >0, y 1. Fnally, the * * * f ( x) sgn[ w x b ] sgn[ y ( x, x) b'] (3.5) 1 n 1 For nonlnear problems [145] [156], the orgnal feature space can be mappe to a hgh-mensonal space by the kernel functon, makng the orgnal samples lnearly separable n a hgh-mensonal space. Constructng the optmal classfcaton hyperplane n a hgh-mensonal space can transform nonlnear problems nto lnear r s problems. Assumng that nonlnear mappng : R H, an the nput space samples R are mappe to a hgh-mensonal feature space H. when the optmal classfcaton hyperplane s constructe n the feature space, the algorthm only uses ot prouct, namely x, x ), n the space rather than nvual x ). Hence, a ( j functon K nees to be foun, whch can satsfy Equaton K x, x ) ( x ) ( x ). In ( ( j j fact, the nner prouct operaton s requre only n the hgh-mensonal space, whch can be acheve by the functons of the orgnal space. Therefore, usng approprate nner prouct K x, x ) n the optmal classfcaton hyperplane can accomplsh ( j lnear classfcaton after a nonlnear transformaton. At ths moment, Equaton (3.4) can be change to Equaton (3.6). n n n 1 ( ) y y j j K( x, x j ) (3.6) 2 1 1 j1 Hgh Performance Latent Drchlet Allocaton for Text Mnng 70

Chapter 3 An the corresponng classcaton functon can be change to Equaton (3.7). n f ( x) sgn[ y K( x, x) b''] (3.7) 1 In Equaton (3.7), b' ' s the classfcaton threshol. If f ( x) 0, x belongs to one class; Otherwse, t oes not belong to another class. SVM has many unque avantages when t eals wth small samples, nonlnear an hgh-mensonal pattern recognton problems [141] [143]. For example, the SVM algorthm s not lmte n samples tenng to nfnty theory. It s also sutable for text classfcaton of a large sample set, an for functon fttng an other machne learnng ssues. Therefore, ths chapter manly ntrouces an uses SVM as the text classfcaton algorthm. But the research showe that when SVM s use alone, ts tranng spee s nfluence greatly by the sze of the tranng set because of the hgh computatonal overhea [142]. 3.2.6 Classfcaton Performance Evaluaton System In general, the text classfcaton results are evaluate from three aspects: valty, computatonal complexty an smplcty of escrpton [157]. Valty measures the ablty of a classfer classfyng correctly. Computatonal complexty nclues tme complexty an space complexty, whch means the classfcaton spee an the sze of the employe harware resources. Smplcty of escrpton means that algorthms are escrbe as smply as possble. In these three aspects, valty s the most mportant factor whch can etermne whether the text classfcaton system s qualfe, an also s the bass for other evaluaton ncators [158]. The specfcs are shown n Secton 3.4.4 Evaluaton methos. Hgh Performance Latent Drchlet Allocaton for Text Mnng 71

Chapter 3 3.3 Text Classfcaton base on LDA 3.3.1 Gbbs Samplng Approxmate Inference Parameters of LDA Moel In orer to obtan the probablty topc strbuton of text, ths thess oes not rectly calculate the wor strbuton on topc an the topc strbuton on ocument. Instea, accorng to the vsble wor sequence n ocument, the posteror probablty p ( w z) namely probablty of wor w gvng topc z can be obtane. At last, usng the Gbbs samplng algorthm can nrectly gan the value of an [30] [108] [111]. Markov Chan Monte Carlo (MCMC) [107] [108] [159] s an approxmate teratve metho, whch can extract sample values from a complex probablty strbuton. It allows the Markov chans converge to target strbuton, an then sample from the Markov chan. Each state of the Markov chan s assgnment for samplng varables. Transtons between states follow smple rules. Gbbs samplng [108] [111] s a smple realzaton form of the MCMC, ts purpose s to construct the Markov chan whch converges to a probablty strbuton of the target, an extract the samples whch are consere close to the value of the probablty strbuton from the Markov chan. So to etermne the probablty strbuton functon of the target s the key of Gbbs samplng. For each wor w, the corresponng topc varable z s assgne an nteger t n [ 1,2,, T], whch means that ths wor correspons to the t-th topc. For each wor token n the text set, w an respectvely express ts wor nex an ocument nex. Gbbs samplng processes each wor token one by one. In the contons of the Hgh Performance Latent Drchlet Allocaton for Text Mnng 72

Chapter 3 known topc strbuton of other wor token, the possblty of current wor token belongng to each topc s estmate. Base on the contonal strbuton namely the posteror probablty, a topc s reselecte as current topc of wor token. Ths contonal strbuton can be represente as p z j z, w ), where z j ( means that w s assgne to topc j. w not only represents a wor, but also relates wth ts poston n the text. z shows the topc of other all wor token except for current wor token, whch means the assgnments of all z k ( k ). For the LDA moel of texts, the wors of topcs only nee to be assgne, whch means that varable z wll be sample. Its equaton s the followng [38]: ( w ) ( ) n, j n, j p( z j z, w ) (3.8) * ( ) n W n T, j,* where T s the number of topcs an W s the number of wors n the unqueness vocabulary. ( ) n w, j s the number of tmes of w that s assgne to topc j. n s ( ), j the number of wors of that are assgne to topc j. (*) n, j s the number of all the wors that are assgne to topc j. n s the number of all the wors that are ( ),* assgne to topcs n. Ths s a non-stanar probablty strbuton an t s not normalze. The complete Equaton (3.9) s that Equaton (3.8) ves by the sum of the probablty of the corresponng all topcs [82]. p( z j z, w ) T n j1 n ( w ), j (*), j n n ( w ), j (*), j n W n ( ), j ( ),* n W n ( ), j ( ),* T T (3.9) where T s the number of topcs an W s the number of wors n the unqueness vocabulary. ( ) n w, j s the number of tmes of w that s assgne to topc j. n s ( ), j the number of wors of that are assgne to topc j. (*) n, j s the number of all the Hgh Performance Latent Drchlet Allocaton for Text Mnng 73

Chapter 3 wors that are assgne to topc j. n s the number of all the wors that are ( ),* assgne to topcs n. The number of all wors oes not nclue the assgnment of z j. To sum up, the basc ea of Gbbs samplng algorthm s to use the obtane strbuton of all topcs on wors an the obtane strbuton of ocument on topcs to nfer current samplng wor w belongng to topc z. It s an teratve process to get all the latent varables [108]. Accorng to the above process, t can be seen that to assgn a wor to a topc s nfluence by two probablty strbutons: one s the strbuton of topcs on wors, another s the strbuton of ocuments on topcs. Equaton (3.9) also shows that for a gven wor token, the factors that affect topc label nclue two parts. The left part correspons to the probablty of wor w on topc j. The rght part correspons to the probablty of topc j appearng on the topc strbuton of ocument. Once a wor s many wor tokens are labele as topc j, the probablty of any wor token of ths wor beng labele as topc j wll ncrease. Therefore, Gbbs samplng algorthm n the LDA moel applyng to topc moelng of corpus s etale below [82] [107] [111]: Frstly, z s ntalze to a ranom nteger between 1 an T namely z [ 1, T], [ 1, N], where T s the number of topcs an N s the number of all the feature wors appearng n the text n corpus. It s the ntal state of the Markov chan. Seconly, all of the N wor tokens n the text set are n turn re-assgne a new topc n each roun of Gbbs samplng, whch means that loops from 1 to N Hgh Performance Latent Drchlet Allocaton for Text Mnng 74

Chapter 3 namely for ( 1, N, ). Accorng to Equaton (3.9), wors are assgne to topcs so that the next state of the Markov chan can be calculate an obtane. Thrly, n the early stage of the samplng process, the result of Gbbs samplng s not accurate enough because the smulaton of posteror probablty s naequate. After enough teratons to the secon step an after the early stage, the result of Gbbs samplng approaches the target strbuton graually. At last, t s a steay state that s very close to the target strbuton. At ths moment, the Markov chan approaches the target strbuton. The current value of z s taken an recore as a sample. Fnally, n orer to ensure that the autocorrelaton s small, other samples are recore after a number of teratons. Wor token s abanone an w represents unque wor. For every sngle sample, an can be estmate by the followng equatons: ~ ( z j) w n n ( w) j (*) j W (3.10) n ( n ( ) ~ ( ) j z j ) * T (3.11) where (w) n j s the frequency of wor w beng assgne to topc j. (*) n j s the number of all wors beng assgne to topc j. ( ) n j s the number of wors n ocument beng assgne to topc j. n s the number of all the wors n ocument beng ( ) * assgne to topcs. 3.3.2 The Specfc Steps of Text Classfcaton The research showe that the tranng convergence rate of SVM on a large ata set s slow. An t nees a lot of storage resource an very hgh computng power. However, ts maxmum nterval classfcaton moel can effectvely conquer the sample Hgh Performance Latent Drchlet Allocaton for Text Mnng 75

Chapter 3 strbuton, reunant features, over-fttng an other factors. Beses t has goo generalzaton ablty. Compare wth other classfcaton algorthms, SVM has the avantages of effect an stablty. [142] [145] has prove that ts performance n text classfcaton s more excellent than other algorthms. Ths chapter uses feature wors probablty strbuton of the LDA moel, an then trans SVM classfer on the latent topc-ocument matrx. In orer to mprove the performance of text classfcaton, the goo ablty of features menson reucton an text representaton of the LDA moel an the strong classfcaton capacty of SVM are combne. The text classfcaton metho base on the LDA moel uses LDA to moel the text set. Usng Gbbs samplng nfers an nrectly calculates moel parameters, obtanng the probablty strbuton of ocuments on topcs [108]. After that, SVM s chosen an trane on the latent topc-ocument matrx. Fnally, a text classfcaton system s constructe. The specfc steps are as follows: Step 1: For the orgnal text set, extractng stem an removng stop wors form a collecton of feature wors. Or choose a sutable an processe text set. Step 2: The number of optmal topcs selecton metho s use to etermne the number of optmal topcs T of the LDA moel. It can make the moel ft the effectve nformaton n corpus ata best. Detals of the metho are gven n the next chapter. Step 3: LDA s use to moel the tranng text set. Moel parameters are nferre by the Gbbs samplng algorthm. After enough tmes of teratons, each text stans for the probablty strbuton of fxe latent topc set. The latent topc-ocument matrx of the tranng text set A t s gane, where t s the number of topcs of the latent topc set an s the number of ocuments. Step 4: SVM s trane on the above matrx A t whch s a SVM classfcaton moel base on LDA.. Then, a text classfer s constructe, Hgh Performance Latent Drchlet Allocaton for Text Mnng 76

Chapter 3 Step 5: The ocuments to be classfe after preprocessng takng the place of ocument n Equaton (3.8) are processe by Gbbs samplng algorthm. After few tmes of teratons, the value of an can be calculate accorng to Equaton (3.10) an (3.11). In aton, the probablty strbuton vector of latent topc set of ocument to be classfe s obtane. Step 6: To brng n the constructe SVM classfcaton moel forecasts the categores of ocuments to be classfe. The ocuments to be classfe are those new ocuments that have not been processe when the corpus s trane. If every unknown text s ae to corpus, the whole moel wll be retrane. Obvously, t wastes too much tme, an also t s not necessary. In ths thess, the ocuments to be classfe after preprocessng are only processe by Gbbs samplng algorthm, reucng the number of teratons. 3.4 Experment an Analyss 3.4.1 Expermental Envronment Ths experment was one uner a personal laptop. Its CPU s Intel(R) Core(TM) 2 Duo CPU T8100 2.10GHz, 2.09GHz. Its memory s 3.00 GB. The capacty of har ware s 160G. The operatng system s Mcrosoft Wnows XP Professonal Verson 2002 SP3. The evelopment envronment s Matlab 7.0. 3.4.2 Tranng Envronment for SVM LIBSVM [160] [161] s a smple, fast an effectve general SVM software package, whch was evelope by Dr Chh-Jen Ln et al. It can solve classfcaton problems Hgh Performance Latent Drchlet Allocaton for Text Mnng 77

Chapter 3 (nclung C-SVC an n-svc), regresson problems (nclung e-svr an n-svr), strbuton estmaton (such as one-class-svm) an other ssues, provng four common kernel functons whch are lnear functon, polynomal functon, raal bass functon (RBF) an S-shape functon. It can effectvely solve mult-class problem, cross-valaton choosng the parameters, weghtng for the unbalance samples, probablty estmaton of mult-class problem an so on. LIBSVM s an open source software package [161]. It not only proves algorthm source coe of C++ language of LIBSVM, but also supples Python, Java, R, MATLAB, Perl, Ruby, LabVIEW, C#.net an other languages nterface. It can be use n Wnows or UNIX platform expeently [160]. Therefore, ths experment makes use of Lbsvm 3.0 as the tranng envronment for SVM, an also python2.71 an graphcs software gnuplot 4.4.3. In bref, the general steps of usng LIBSVM are [160] [161]: Step 1: Accorng to the requre format of LIBSVM software package, the ata set s prepare. Step 2: The smple scalng operaton s performe on the above ata. Step 3: RBF kernel functon s consere to use lke K( x, y) e. Step 4: Use cross-valaton to select optmal parameters C an. Step 5: Use the above optmal parameters C an to tran the whole tranng set, obtanng the SVM moel. Step 6: The above gane moel s use to test an forecast. xy 2 3.4.3 The Data Set The qualty of the tranng text set can nfluence the performance of machne learnng classfer greatly [24]. Thus, the tranng text set must be able to represent varous Hgh Performance Latent Drchlet Allocaton for Text Mnng 78

Chapter 3 ocuments whch nee to be processe by the classfcaton system accurately, whle tranng ocuments are able to reflect the complete text statstcal nformaton of the class accurately. In general, the text set shoul be recognze an classfe corpus n the stuy of text classfcaton so that the performance of fferent classfcaton methos an systems can be compare accurately [14] [162]. Currently, the commonly use text set nclues WebKB,Reuters21578,Reuters Corpus Volumnl(RCVl) an 20Newsgroup etc [163]. WebKB ata set was one by the Worl We Knowlege Base Project of Carnege Mellon Unversty text learnng group, whch s consste of web ata from web stes of several unverstes epartments of computng. The Reuters21578 text set contane text whch Reuters has been collectng snce 1987. Furthermore, text nexng an classfcaton were one manually by Reuters. The RCV1 ata set s also from Reuters, an manly for mult-label text classfcaton. The 20Newsgroup ata set contans about 20000 ocuments from 20 news groups [163] [164]. Ths s a balance corpus, an ts each class has 1000 ocuments approxmately. As shown n Table 3.1, ths experment chooses comp subset where there are fve classes. Lkewse, each class has almost 1000 ocuments. In aton, the tranng set an test set were ve accorng to the rato of 3:2. Table 3.1: The strbuton of fve classes of text n the 20newsgroup corpus. Class number Class The number of texts Dstrbuton 1 comp.graphcs 975 19.94% 2 comp.os.ms-wnows.msc 983 20.10% 3 comp.sys.bm.pc.harware 985 20.14% 4 comp.sys.mac.harware 961 19.65% 5 comp.wnows.x 986 20.16% Hgh Performance Latent Drchlet Allocaton for Text Mnng 79

Chapter 3 3.4.4 Evaluaton Methos In orer to evaluate the valty of the automatc text classfcaton system, the tratonal evaluaton crtera are use as evaluaton methos frstly, whch contans Precson P, Recall R an F1 measure F1 [157] [158]. Ther equatons are as follows: t P (3.12) m where t s the number of correctly classfe texts n the -th class. m s the number of texts whch are classfe by a classfer as belongng to the -th class. Equaton (3.12) shows the precson of the -th class. t R (3.13) n where t s the number of correctly classfe texts n the -th class. n s the number of texts belongng to the -th class n the orgnal text set. Equaton (3.13) means the recall of the -th class. F P R 2 1 (3.14) P R where P s the precson of the -th class. R s the recall of the -th class. Equaton (3.14) expresses the comprehensve classfcaton rate of the -th class, an t s a usual metho of performance evaluaton whch combnes precson an recall together. However, precson, recall an F1 measure are all for the performance of a sngle class. When the whole performance of a classfcaton metho s evaluate, the results of all classes nee to be consere an combne. In general, there are two kns of comprehensve methos, macro-average an mcro-average [157]. Compare wth mcro-average nex, macro-average nex s affecte greatly by small classes. They respectvely nclue macro-average precson Macro _ P, macro-average Hgh Performance Latent Drchlet Allocaton for Text Mnng 80

Chapter 3 recall Macro _ R, macro-average F1 measure Macro _ F1, mcro-average precson Mcro _ P, mcro-average recall Mcro _ R an mcro-average F1 measure Mcro _ F1. Macro-average s the arthmetc mean of performance nex of each class, an mcro-average s the arthmetc mean of performance nex of each ocument [157]. For a sngle ocument, ts precson an recall are the same. In aton, a ocument only can gve a precte class, whch s ether 1 or 0. Therefore, ts mcro-average precson an recall are the same. Accorng to Equaton (3.14), for the same ata set, ts mcro-average precson, mcro-average recall an mcro-average F1 measure are the same. Ther equatons are as follows: k 1 Macro _ P P (3.15) k 1 where P s the precson of -th class an k s the total number of classes of the orgnal text. k 1 Macro _ R R (3.16) k 1 where R s the recall of -th class an k s the total number of classes of the orgnal text. 2 Macro _ P Macro _ R Macro _ F1 (3.17) Macro _ P Marco _ R where Macro _ P s macro-average precson an Macro _ R s macro-average recall. k t 1 Mcro _ F1 Mcro _ R Mcro _ P (3.18) k m where t s the number of correctly classfe texts n the -th class, 1 Hgh Performance Latent Drchlet Allocaton for Text Mnng 81 m s the

Chapter 3 number of texts whch are classfe by a classfer as belongng to the -th class, an k s the total number of classes of the orgnal text. 3.4.5 Expermental Results Frst of all, the orgnal text of experment ata set s preprocesse, whch nclues extractng wor stem, Stemmng an removng stop wors [148]. For example, Stemmng functon of Snowball (porter2) can be use to extract wor stem, formng feature wors, such as success nstea of successful. The Stemmng technology can make the nflectons revert to the orgnal wor. Some stop wors contrbuton to the classfcaton accuracy s lttle, whch shoul be abanone. Such as a,an,the, of. So that, t can effectvely reuce the menson of the feature space, mprove the processng spee an ecrease overheas of the algorthm [149]. Then, the whole tranng set s moele by LDA, an settng 50/ T, 0. 01 accorng to the experence [165]. The number of topcs s gven as 120 namely T 120, the reason an etals wll be showe n Secton 4.2 n the next chapter. After 1000 tmes of teratons of Gbbs samplng algorthm, the latent topc- ocument matrx A t s obtane. At ths moment, each ocument represents the probablty mxture strbuton p ( z ) n the topc set whch has 120 topcs. Fnally, a classfer base on SVM s bult on the matrx A t acheve by Lbsvm an Matlab platform.. SVM classfcaton algorthm s VSM s use for text representaton, an SVM s use for classfyng after extractng features as contrast experment 1. The wor-ocument matrx s ecompose by SVD of LSI, bulng a latent semantc matrx, realzng the menson reucton of features, an combnng wth SVM algorthm as contrast experment 2. Ths process s fnshe by Matlab. For example, functon svs s nvoke to ecompose the Hgh Performance Latent Drchlet Allocaton for Text Mnng 82

Chapter 3 wor-ocument matrx, generatng a latent semantc space. Commonly, the value of k s between 100 an 300 [166]. If the value of k s too small, the useful nformaton may be lost. But f t s too large, the storage space wll lmt t. Accorng to the tranng text set n ths experment, the fferent values of k were teste, fnally t was set to 150. The results of the propose text classfcaton metho base on the LDA moel n ths chapter, contrast experments 1 an 2 are shown n Fgure 3.3 an Table 2. Fgure 3.3: Comparson of the performance of three methos on each class. If the fference among the classes n the ata set s large, the fference between macro-average an mcro-average wll be relatvely large. In Table 3.2, ther fference s lttle, whch ncates the fference among the classes n the tranng set s small. It s consstent wth the actual stuaton. Snce the fve classes n ths Hgh Performance Latent Drchlet Allocaton for Text Mnng 83

Chapter 3 experment belong to the same comp subset, the fference between each other s lttle. Table 3.2: Comparson of three methos macro-average an mcro-average. Methos Macro_P Macro_R Macro_Fl Mcro_Fl VSM+SVM 0.804153 0.805347 0.805023 0.80574 LSI+SVM 0.825966 0.822337 0.82398 0.828164 LDA+SVM 0.865325 0.86968 0.864919 0.868738 After preprocessng, the number of features n the feature set of the orgnal corpus s 36487. In contrast experment 2, the menson of semantc space can be reuce greatly by SVD of LSI. An compare wth contrast experment 1, contrast experment 2 uses less number of features an also guarantees the effect of classfcaton. The expermental results show that a text classfcaton metho base on LDA whch uses Gbbs samplng to nfer moelng parameters an chooses SVM as classfcaton algorthm obtans a better effect of classfcaton. In Table 3.2, all of ts evaluaton parameters reach 86%, whch are hgher than the other two methos. Table 3.3: Comparson of three methos mensonalty reucton egree to corpus. VSM LSI LDA The number of features after 1600 150 120 mensonalty reucton Dmensonalty reucton egree 95.615% 99.589% 99.671% Furthermore, t can be showe that a LDA moel wth 120 topcs moelng text set can ecrease feature space up to 99.671% an reuce the tranng tme of SVM n Table 3.3. An the results show that ts performance of classfcaton s not affecte. It Hgh Performance Latent Drchlet Allocaton for Text Mnng 84

Chapter 3 not only can overcome the problem of the amage classfcaton performance whch s cause by feature menson reucton, but also avo the ssue whch oes not conser the semantc relatons among wors whle extractng features. Compare wth two contrast tests, the metho of constructng a text classfer base on LDA an SVM can mprove the performance an effcency of text classfcaton effectvely. 3.5 Summary At the begnnng, ths chapter summarze the key technologes of text classfcaton, nclung text preprocessng, text representaton moel, text feature extracton an menson reucton, text classfcaton moel, an classfcaton performance evaluaton. An then, a metho of text classfcaton was propose. It mae use of LDA probablstc topc moel as text representaton, whch s combne wth SVM classfcaton algorthm to construct a text classfer. Each ocument s represente as a probablty strbuton of the fxe latent topc set. A etale escrpton about how to use Gbbs samplng algorthm to estmate the moelng parameters was ntrouce. Beses, the steps of the propose classfcaton metho were escrbe. Fnally, the valty an avantages of the text classfer base on LDA an SVM were valate by contrast experments of three methos, whch are VSM+SVM, LSI+SVM an LDA+SVM. Hgh Performance Latent Drchlet Allocaton for Text Mnng 85

Chapter 4 Chapter 4 Accuracy Enhancement wth Optmze Topcs an Nose Processng 4.1 Introucton In the LDA moel, topcs obey Drchlet strbuton. Ths strbuton assumes that a topc has nothng to o wth other topcs. But n fact, there s a relatonshp among topcs. The contracton between ths nepenent assumpton an the real ata makes the changes of the number of topcs affect the LDA moel greatly. Therefore, when the corpus s moele by LDA, the number of topcs T has a great effect on the LDA moel fttng the text set. In aton, some research showe that nosy ata n the tranng text set can nfluence the result of text classfcaton baly. So n aton to an effcent an feasble classfcaton algorthm, t s necessary to process nosy ata n the tranng text set before constructng the classfer an performng classfcaton. Ths way, the qualty of classfer an the accuracy of the result of classfcaton can be ensure. 4.2 The Metho of Selectng the Optmal Number of Topcs In the last chapter, the corpus was moele by LDA an the SVM algorthm was combne so as to bul a classfer. The number of topcs was 120 n the experment. Thus, each ocument was expresse as 120-mensonal polynomal vector of topc set, an the feature space was reuce by 99.671%. But, how was the optmal number of topcs 120 etermne? The metho of selectng the optmal number of topcs wll be ntrouce below. Hgh Performance Latent Drchlet Allocaton for Text Mnng 86

Chapter 4 4.2.1 Current Man Selecton Methos of the Optmal Number of Topcs Base on LDA Moel. Topcs n the LDA moel can capture the relatonshp among wors, but LDA cannot express the relatonshp among topcs because the samplng base on the Drchlet strbuton assumes that topcs are nepenent wth each other. It wll lmt the capabltes of LDA representng the large-scale ata set an prect the new ata. Therefore, many scholars have begun to research more abunant structure to escrbe the relatonshp among topcs. Ble, etc. presente the Correlate Topc Moel (CTM) [85]. The key of CTM s that t uses the Logstc-Normal strbuton nstea of the Drchlet strbuton to escrbe latent topcs of ocument collectons. The Logstc-Normal strbuton has two sets of parameters, mean vector an covarance matrx. The role of mean vector s smlar to the Drchlet parameters n LDA, whch s apple to express the relatve strength of latent topcs. Covarance matrx represents the correlaton between each par of latent topcs. The above structure nformaton s not nvolve n the LDA moel. But, CTM also has lmtaton whch can only escrbe the correlaton between two topcs. Thus, the Pachnko Allocaton Moel (PAM) was put forwar further [167]. Its core ea s to use the Drecte Acyclc Graph (DAG) to represent the structure among latent topcs. In DAG of PAM, each leaf noe stans for a wor, every ntermeate noe represents a topc, an each topc s a multnomal strbuton base on ts chl noe. PAM can exten the meanng of topcs. It not only can be a multnomal strbuton base on the wor space, but also can be a multnomal strbuton base on other topcs whch are calle super topcs. Therefore, PAM can not only escrbe the relatonshp among wors, but also escrbe the relatonshp among topcs flexbly [167]. However, t s the same as other topc moels where the number of topcs T has to be Hgh Performance Latent Drchlet Allocaton for Text Mnng 87

Chapter 4 etermne manually. To set the number of topcs T wll affect the extracte topc structure rectly. 4.2.1.1 The Metho of Selectng the Optmal Number of Topcs Base on HDP Herarchcal Drchlet Process (HDP) was propose by Y.Teh [88] as a herarchcal moelng approach n 2005, whch s a kn of generatve moel. In aton, Y.Teh [86] put forwar usng HDP to fn the optmal number of topcs T n LDA. HDP s a kn of strbuton on ranom probablty measure set, whch can moel the groupe ata whch has a preefne multlayer structure. HDP moel s shown n Fgure 4.1, the rectangular boxes are plates that represent replcates. The hollow crcles represent the latent varables G 0, G j an j, an the flle crcle represents the observe varable x j. In aton, the squares represent the parameters or the base strbuton H, an 0, an the black arrow represents the contonal epenency between two varables, namely the contonal probablty. HDP contans a global probablty measure G 0. Each preefne group s expresse by a Drchlet Process(DP). Ths DP corresponng probablty measure G j can be obtane from the upper level of DP samplng, such as, G ~ DP(, ). The members of G j 0 0 0 G0 G are share by all the DP, but the fferent DP has a fferent mxng rato G. The 0 hyper-parameters of HDP nclue the base strbuton H, aggregaton egree parameter an 0. H prove pror strbuton for j. G 0 obeys the DP wth H an, such as G, H ~ DP(, ). If an HDP moel can be use as a groupe 0 H j ata about the pror strbuton of j, for any group j, t assumes that, 2,, j1 j j are nepenent entcally strbute ranom varables of j G, such as j G j ~ G j. Each j strbuton can be use to generate the corresponng Hgh Performance Latent Drchlet Allocaton for Text Mnng 88

Chapter 4 observe varable x, such as x ~ F( ). j j j j Fgure 4.1: HDP moel. HDP moel can be seen as a nonparametrc moel of LDA. In the LDA moel, the weghts of T topcs n the text shoul be etermne accorng to the Drchlet strbuton frstly. Next, one of T topcs s chosen, an the specfc wors are selecte accorng to the probablty of all wors n the topc. Furthermore, the number of topcs n the HDP moel can be extene. Y.Teh consere that the structure of HDP s smlar to the structure of LDA so that the nonparametrc feature of HDP was use to solve the problem of selectng the optmal number of topcs n LDA [86]. Its basc ea s to use a near-nfnte probablty measure G 0 to replace the fnte mxture of topcs n LDA. A new DP an G j are bult for each ocument epenng on the fferent mxng rato. All ocuments share the mxng element of G 0. The experment foun that the optmal number of mxng element was consstent wth the optmal number of topcs n LDA by analyzng the hstogram of samplng the number Hgh Performance Latent Drchlet Allocaton for Text Mnng 89

Chapter 4 of mxng element n the HDP moel [86]. Therefore, t can solve the problem of selectng the optmal number of topcs T n LDA. However, the above metho has to establsh respectvely an HDP moel an a LDA moel for the same one ata set. Obvously, when a large set of ocuments or the real text set s processe, the metho base on HDP wll spen lots of tme n computng [87] [168]. 4.2.1.2 The Stanar Metho of Bayesan Statstcs In the LDA topc moel, the selecton of the number of topcs nees to be gven manually. An, the number of topcs wll affect the qualty of LDA moelng ocument collecton rectly. The classc metho s to use the stanar metho of Bayesan statstcs [82] to select the optmal number of topcs T. In the LDA moel, an respectvely are the Drchlet pror probablty hypothess on the multnomal strbuton an. The feature of ts natural conjugate means that the value of the jont probablty p ( w, z) can be obtane by ntegratng an. Because p( w, z) p( w z) p( z) where an appear n the frst an secon tem of the rght se of the equaton respectvely, the value of the frst tem p ( w z) can be gane by ntegratng. The corresponng equaton s as follows: ( W ) p( w z) W ( ) T T j1 ( n ( n ( w) w j (*) j ) W ) where, (*) s the stanar gamma functon. (4.1) (w) n j stans for the frequency of wor w beng assgne to topc j. (*) n j represents the number of all the wors beng assgne to topc j. p ( w T) can be approxmate to the average value of p ( w z), an M s efne as the frequency of samplng n the Gbbs samplng algorthm. The corresponng Equaton s as follows: Hgh Performance Latent Drchlet Allocaton for Text Mnng 90

Chapter 4 1 1 M p( w T) M m1 1 p( w z ( m) ) (4.2) The probablty strbuton of the fferent number of latent topcs for moelng text set can be obtane by Equaton (4.2), whch s the value of p ( w T). The value of p ( w T) s proportonal to the fttng egree of the LDA moel to effectve nformaton n the text set, to etermne the optmal number of latent topcs T [82]. 4.2.2 A Densty-base Clusterng Metho of Selectng the Optmal Number of Topcs n LDA The clusterng s to gather a large number of ata samples nto many classes, whch can make the smlarty among samples n the same class maxmum but n the fferent class mnmum. From ths perspectve, a ensty functon can be esgne to calculate the ensty aroun each sample. Accorng to the value of the ensty aroun each sample, the areas where the samples are relatvely concentrate can be foun. These areas are the targete classes. Ths clusterng metho s calle ensty-base clusterng [169] [170]. Its basc ea s to use the hgh-ensty areas among the low-ensty areas n the ata space to be efne as classes. The ajacent text areas wth a major strbuton ensty of the ata space are connecte to recognze the rregular shape of classes, remove some nosy ata an process abnormal ata effectvely. In other wors, as long as the ensty of ponts n an area s greater than a threshol value, the ponts wll be ae nto the cluster whch s close to them. In aton, t has been prove that the ensty-base clusterng s a very effectve clusterng metho. The representatve algorthm s the Densty-Base Spatal Clusterng of Applcaton wth Nose (DBSCAN) [171] [172] [173] [174]. Hgh Performance Latent Drchlet Allocaton for Text Mnng 91

Chapter 4 4.2.2.1 DBSCAN Algorthm Frst of all, DBSCAN efnes two parameters: the cluster raus an the value of MnPts. The former s the cluster ensty whch s represente by the mean stance among clusters. The latter s the number of text wthn the scope of a sngle cluster. Then, several basc concepts of DBSCAN algorthm wll be ntrouce as follows [171] [172] [175]: Densty: the ensty of any pont n space s the number of ponts n the crcular area where the centre of the crcle s the pont an the raus s. -Neghborhoo: the neghborhoo of any pont n space s the ponts n the crcular area where the centre of the crcle s the pont an the raus s. Core Object: f the ensty of any pont n space s greater than a gven threshol MnPts, t s calle a core object. Borer Object: f the ensty of any pont n space s less than a gven threshol MnPts, t s calle borer object. Drectly Densty Reachable: a ata set D, an MnPts are gven, an p D, q D. If q s a core object an p s the -Neghborhoo of q, p s rectly ensty reachable from q. Densty Reachable: a ata set D, an MnPts are gven. If there s a chan p D( 1 n,) an p q, p 1 n p, an p can reach p 1 rectly, p s ensty reachable from q. Densty Connecte: a ata set D, an MnPts are gven, an p D, q D. If there s an object r, an p an q are ensty reachable from r, p an q are ensty connecte. Cluster: t s the collecton of the maxmum ensty connecte ponts base on rectly ensty reachable. Nose: the nose s the objects whch are not nclue n any cluster. Hgh Performance Latent Drchlet Allocaton for Text Mnng 92

Chapter 4 In summary, the basc ea of DBSCAN algorthm s that for each object n a cluster, the number of objects n the neghborhoo wth the gven raus must be greater than a gven value. In other wors, the neghborhoo ensty must be greater than a certan threshol MnPts. Accorng to the above nne efntons, t can be seen that DBSCAN algorthm s base on that any core pont object can etermne a unque class [174]. In bref, ts algorthm process s as follows [171] [173]: frstly, the core object s searche n a ata set. Once a core object s foun, a cluster wth ths core object s generate. Next, DBSCAN wll search rectly ensty reachable objects from these core objects by loop teraton, n whch some ensty reachable clusters may be merge. Fnally, the en conton of ths algorthm s that no any new sample ponts can be ae to any cluster. In aton, the ensty-base clusterng algorthm DBSCAN s sutable for processng large-scale text set wthout specfyng the number of classes. An, t also has a hgh processng spee, hanles nose ponts effectvely, an scovers arbtrary shape of clusters [171] [175]. 4.2.2.2 The Relatonshp between the Optmal Moel an the Topc Smlarty The LDA moel can extract the latent topc structure n ata set by analyzng a large number of statstcal ata. An, each ata set has an optmal structure. Thus, when the average smlarty among topcs s the smallest, the moel wll be optmal. The followng s the relevant proof n theory. The strbuton of topc n a V-mensonal wor space n matrx p w v z ) s ( use to represent topc vector, an the relevance among topc vectors s measure by stanar cosne stance of vectors: Hgh Performance Latent Drchlet Allocaton for Text Mnng 93

Chapter 4 corre( z, z j ) corre( V v jv vo, j ) (4.3) V V 2 2 ( v ) ( jv ) v0 v0 The smaller the value of corre z, z ), the more nepenent the relevance between ( j topcs. At ths moment, the average smlarty among all the topcs s use to measure the stablty of the topc structure: avg _ corre( structure) T 1 T 1 j1 corre( z, z T * ( T 1) / 2 arg mn( avg _ corre( structure)) arg mn arg mn( avg _ corre( structure)) arg mn j ) T 1 T 1 j1 T 1 corre( z, z T *( T 1) / 2 T 1 j1 V v0 j ) V v jv vo V 2 2 ( v ) ( jv ) v0 (4.4) (4.5) Accorng to Bayes theorem [82], p w z ) p( z w ) p( w ). Thus, Equaton ( n n n n n (4.5) can be converte nto the functon about p z w ). Combne wth the T ( v constrant conton of the LDA moel p( z w v ) 1, when all w v on a certan z 1 get obvous peaks, the above equaton can meet the conton. The conton can be further expresse as: T 2 max p( z w v ) (4.6) 1 Meanwhle, the optmzaton functon can be state as Equaton (4.7) n the process of EM algorthm teratvely computng the moel parameters an. * * (, ) arg max p ( D, ) arg max p(, ) (4.7) D Hgh Performance Latent Drchlet Allocaton for Text Mnng 94

Chapter 4 where N T p(, ) p( ) p( z ) p( w z ) (4.8) n1 1 n Ble et al use varatonal metho to approxmate nference [32], two latent varables an were brought n. Equatons (4.9) an (4.10) can be obtane as follows: N (4.9) M n1 N 1 n1 n j w (4.10) j n n where topc stans for the probablty p z w ) that the n-th wor s generate by n z. ( n Equatons (4.9) an (4.10) are put nto Equaton (4.8), to euce that Equaton (4.8) wll satsfy the followng relatonshp: N T 2 p(, ) ( ) (4.11) n1 1 n Obvously, when Equaton (4.6) s satsfe, p (, ) can reach maxmum. Accorng to Equaton (4.7), t meets the conton of the optmal moel at ths moment. Therefore, t s prove that when the average smlarty among topcs s the smallest, the moel s optmal n the LDA moel. 4.2.2.3 A Metho of Selectng the Optmal Number of Topcs Base on DBSCAN As the stanar metho of Bayesan statstcs n Secton 4.2.1.2, the optmal number of topcs n LDA can be foun uner the conton of the gven number of topcs [82]. By the theoretcal proof n Secton 4.2.2.2, the relatonshp between the optmal LDA moel an the topc relevance can be obtane. Namely, when the average smlarty of the topc structure s the smallest, the corresponng moel s optmal. Usng t as a Hgh Performance Latent Drchlet Allocaton for Text Mnng 95

Chapter 4 constrant conton, selecton of the optmal number of topcs T an parameters estmaton of the LDA moel are combne n a framework. Accorng to the ea whch calculates the ensty of samples to measure the correlaton among topcs base on the ensty-base clusterng algorthm, a metho of selectng the optmal number of topcs base on DBSCAN s propose n ths chapter. Its basc ea s to calculate the ensty of each topc, an fn the most unstable topc uner the known structure. Then, t s terate untl the moel s stable. Next, the optmal number of topcs n the LDA moel can be foun automatcally wthout a gven number of topcs n avance. At the en, the effectveness of the propose metho wll be verfe by experment n Secton 4.2.3. At the begnnng, accorng to the basc ea an concepts of DBSCAN [171] [172] [173], three efntons are mae as follows: Topc ensty: topc Z an stance R are gven. Draw a crcle wth the centre of the crcle beng Z an the raus R. Accorng to Equaton (4.3), the smlarty between topc Z an other topcs s calculate respectvely. The number of topcs wth the smlarty values fallng wthn the crcle s calle the ensty of Z base on R, wrtten as Densty (Z, R). Moel parameter: a topc moel M an a postve nteger n are gven. The number of topcs whose ensty s less than or equal to n n the moel s calle the base of the moel, wrtten as Parameter (M, n). Reference sample: for a pont z n the topc strbuton, raus r an a threshol m, f Densty(z,r) m, z representng the wor space vector s calle a reference sample of topc z. But, a reference sample s not a real ocument vector n the ata set. It s a vrtual pont n the spatal strbuton of wors. On the bass of the above efntons, the process steps of the metho of selectng the optmal number of topcs base on ensty are as follows: Hgh Performance Latent Drchlet Allocaton for Text Mnng 96

Chapter 4 Step 1: Accorng to any gven ntal values of K, the complete statstcal matrx of the Drrchlet strbuton s ntalze by ranom samplng, to obtan an ntal moel LDA(α,β). Step 2: Takng the topc strbuton matrx of the ntal moel LDA(α,β) as a result of the ntal clusterng, the smlarty matrx among all the topcs an the average smlarty are calculate accorng to Equaton (4.3), r avg _ corre( ). Base on r, the ensty of all topcs s obtane, whch s represente as Densty(Z,r). Accorng to the secon efnton, let n = 0, parameter P of ths moel M can be calculate, whch s expresse as P = Parameter (M, 0). Step 3: Accorng to the reference value of K n Step 2, the moel parameters of LDA wll be re-estmate. The upate functon of K s as follows: Kn 1 Kn f ( r) ( Kn Pn ) (4.12) In Equaton (4.12), f (r) stans for the recton of change of r. When t s opposte to the prevous recton, f ( r) 1 f ( ). When t s the same as the prevous n 1 n r recton, f n 1( r) f n ( r). Let f 0 ( r) 1, when f ( r) 1, topcs wll be sorte by ensty from small to large. The frst P topcs are regare as reference samples, an then the complete statstcal matrx of the next LDA moel parameter estmaton s ntalze. Step 2 an step 3 are repeate untl the average smlarty r an parameter K converge at the same tme. Equaton (4.12) shows that the conton of K convergence s arg mn( K P ). From the efnton of moel parameters, P ncreases wth the ecrease of the average n n smlarty r, an P K. When r s the mnmum, P s the maxmum. Hence, r an n n K can be guarantee to converge at the same tme. At ths moment, the obtane value of K s the optmal number of topcs n the LDA moel. Hgh Performance Latent Drchlet Allocaton for Text Mnng 97

logp(w T) Chapter 4 4.2.3 Experment an Result Analyss In ths Chapter, the experment envronment, ata set an the relate ata preprocessng are the same as the prevous Chapter. The basc ea of ths experment s to use the stanar metho of Bayesan statstcs n Secton 4.2.1.2 to fn the optmal number of topcs T n the LDA moel frstly. Then, the value of T s regare as a reference value. Fnally, the obtane optmal number of topcs by the propose metho base on ensty s compare wth the reference value, to verfy the effectveness of the propose metho. Let 50/ T, 0. 01 [165], an T take fferent values, such as 20, 40, 60, 80, 100, 120, 140, 160, 180, 200. Accorng to fferent values of T, Gbbs samplng algorthm s run respectvely, an the changes of the values of log p ( w T) are observe. The expermental result s shown n the Fgure 4.2 below. -15000000 20 40 60 80 100 120 140 160 180 200-15500000 -16000000-16500000 -17000000-17500000 -18000000-18500000 Number of Topcs n LDA (T) Fgure 4.2: The relatonshp between logp(w T) an T. Hgh Performance Latent Drchlet Allocaton for Text Mnng 98

Chapter 4 It can be seen from Fgure 4.2 that when the number of topcs T s 120, the value of log p ( w T) s the mnmum. After that, t begns to ncrease sharply. Ths result shows that when the number of topcs T s 120, the moel can ft the useful nformaton of the corpus best. Therefore, the number of topcs was set to 120 n the experment n Chapter 3. In other wors, the optmal number of topcs n the LDA moel s 120, whch wll serve as a reference value to prove that the propose metho base on ensty n ths Chapter s effectve an feasble. The metho base on ensty of selectng the optmal number of topcs s use. T s ntalze wth seven fferent values for seven groups of tests, such as 10, 50, 100, 200, 300, 400 an 500. Ther results are shown n Table 4.1. It can be seen from Table 4.1 that the propose metho can manly fn the optmal value of T after several teratons, meanwhle the optmal value of T s 120. In aton, f the ntal value s closer to the optmal value of T, the number of teratons wll be less. Table 4.1: Results of the propose algorthm to fn the optmal value of topc. Intal values of topcs The optmal value of topc Iteratons 10 118 19 50 117 27 100 125 6 200 124 14 300 123 25 400 126 31 500 124 38 In the stanar metho of Bayesan statstcs, the number of topcs T has to be Hgh Performance Latent Drchlet Allocaton for Text Mnng 99

Chapter 4 assgne a seres of values manually. After Gbbs samplng algorthm for the LDA moel, the corresponng values of log p ( w T) are gane n sequence so that the optmal number of topcs can be etermne. However, the gven values of T are not unversal, an the values of log p ( w T) have to be calculate. Therefore, ts processng tme s long, an ts effcency s poor. The metho base on DBSCAN of selectng the optmal number of topcs can fn the optmal number of topcs n the LDA moel automatcally wth relatvely less teratons, whch oes not nee to be assgne the number of topcs manually. It can be seen from the above experment that the propose metho base on ensty s able to fn the optmal number of topcs n the LDA moel. Therefore, t s effectve an feasble. 4.3 Nosy Data Reucton 4.3.1 The Nose Problem n Text Classfcaton In orer to eal wth huge amounts of text nformaton an mprove the effcency of classfcaton, the automatc text classfcaton base on statstcal theory an machne learnng has replace the manual text classfcaton graually. Generally, automatc text classfcaton s ve nto two stages. In the frst stage, a classfer s constructe by the tranng text set wth category labels. In the secon stage, new ocuments are classfe by the constructe classfer. As can be seen, the qualty of the classfer wll nfluence the fnal result of text classfcaton rectly, an t epens on the qualty of the tranng text set to a great extent. Generally speakng, f classes of the tranng text set are more accurate an ts content s more comprehensve, the qualty of the obtane classfer wll be hgher [89]. But n the actual applcaton of text classfcaton, t s ffcult to prove the tranng Hgh Performance Latent Drchlet Allocaton for Text Mnng 100

Chapter 4 text set of hgh qualty an precse classfcaton n avance [91]. Some coarse classfcaton ata s often use as tranng text rectly, such as the collate webpage ata accorng to the preefne classes an papers whch are ve nto varous classes n journals. These ata has the class features, whch means that ata n each class has some common features. But n the text set of varous classes, there are a certan number of ocuments whch have wrong class labels. In other wors, the content of the ocuments oes not match wth the tagge classes. These ocuments wth wrong class labels are calle nose ata n text classfcaton. If the text ata of coarse classfcaton wth nose ata s use to tranng classfer, the accuracy of the classfcaton result wll be nfluence [90]. Generally, the nose ata can be ve nto two categores [89]: characterstc nose an category nose. In a text classfcaton system, all the categores are usually on the same level, whch means they are n the same plane of the class space. When the classfcaton operaton s execute, text to be classfe wll be compare wth all of the categores. The text wll be assgne to approprate categores. In the fel of text classfcaton, the characterstc nose of the tranng text set usually can not be consere. The nfluence of the category nose to the classfcaton result s manly reflecte n the followng two aspects: In the process of classfcaton, except for the classfer whch s really assocate wth text to be classfe, the output of other classfers s nose nformaton whch affects the fnal classfcaton result. It s ffcult to correct classfcaton error whch s cause by the classfcaton nose. An n the practcal applcaton of text classfcaton, a large number of classes an large ambguty among classes make ths kn of nose nfluence the classfcaton performance more serously. In the metho of plane classfcaton, when the number of categores ncreases, the processng tme of the classfer an the computatonal cost wll ncrease. For Hgh Performance Latent Drchlet Allocaton for Text Mnng 101

Chapter 4 ether the sngle classfcaton system where the classfer only outputs one of the most relevant classes, or the mult-classfcaton system where the classfer can output multple relate classes, the number of categores whch are really relate to the text to be classfe s lmte. In fact, lots of processng tme an memory space s spent on ealng wth category nose, whch leas to the ncrease n the cost of the classfer an the ecrease n the response tme an classfcaton effectveness. In aton, there are two kns of sources n category nose [91]: the frst one s the nconsstent sample. For example, the same one sample s n fferent categores. Another one s the wrong label. For nstance, the tranng samples are classfe manually wth wrong labels. The latter often happens n the classfcaton of large-scale tranng set, so the research of ths chapter only focuses on the secon form of category nose. 4.3.2 The Current Man LDA-base Methos of Nose Processng 4.3.2.1 The Data Smoothng Base on the Generaton Process of Topc Moel In the classcal LDA moel, each text has ts own nepenent topc strbuton. In text classfcaton, assume that the ocuments n the same class have the same probablty strbuton of topcs. Thus, the constructon process of the tranng text set can be regare as a generaton process of topc moel. Each class n the text set has a corresponng latent probablty topc moel. For text collecton whch s corresponng to each class n tranng set, a probablty topc moel can be obtane by a topc extracton algorthm. The new text whch s generate by the topc moel stll belongs to the class whch s corresponng to ths moel [92]. To sum up, f wor token s kept, Equaton (3.10) an Equaton (3.11) n Secton 3.3.1 n Chapter 3 can be change to: Hgh Performance Latent Drchlet Allocaton for Text Mnng 102

Chapter 4 WT C j '( z j) W (4.13) WT C W k 1 D k 1 1 kj DT Cj ' 1 z j T (4.14) D DT C T k where W stans for the number of all the wors n the text set. T represents the number of all the topcs n the text set. D means the number of all the ocuments n the text set. WT C an DT C are respectvely the ntegral matrx of W T an D T. W WT C kj k 1 WT C j stans for the frequency of wor w beng assgne to topc j. represents the frequency of all wors beng assgne to topc j. D DT C j 1 T D DT C k k 1 1 means the frequency of wors n all the text beng labele. shows the total number of tmes wors n all the text beng labele wth topcs. Therefore, as long as the probablty topc moels whch are corresponng to each class can be extracte from the tranng text set, the whole tranng text set s able to be regenerate by these moels. The regenerate new text obeys the probablty strbuton of real ata, whch can be wely use n smoothng nose an other fels [92] [176] [177]. The process of the algorthm whch can extract the LDA moel from the text collecton of a sngle category s shown as follows: Step 1: All the wors n the text set are statstc to construct the vector t t, t,, t ), where N s the total number of wors. ( 1 2 N Step 2: Wor nex WS an ocument nex DS of each wor are establshe. WS stans for the wor nex of the -th wor. DS represents the ocument nex of the -th wor, whch means that the -th wor s from the DS -th ocument. Hgh Performance Latent Drchlet Allocaton for Text Mnng 103

Chapter 4 Step 3: The topc label vector of wors z s ntalze ranomly. For example, z shows that the -th wor s labele to the z -th topc. WT C an DT C are upate at the same tme, an vector zt s use to recor the number of occurrences of each topc. Step 4: For each wor w, let the corresponng values of WT C, DT C an zt ecrease by 1, where { 1,2, L, N}. Step 5: Accorng to Equaton (3.9) n Secton 3.3.1 n Chapter 3, the probablty of each wor w belongng to varous topcs p z j z, w ) can be compute, ( where j 1,2, L, T. Step 6: Accorng to the value of p z j z, w ) namely a new probablty ( strbuton of topcs, a topc j s chosen ranomly as a new topc of wor w, an all the wors n the text set are performe n turn. Beses, WT C, DT C an zt are upate at the same tme, meanwhle ther corresponng values wll ncrease by 1. Step 7: All the wors n the text set are repeate by steps 4, 5 an 6. Step 8: The steps 4, 5, 6 an 7 are one computaton pero of Gbbs samplng. Repeat the process untl that the samplng result reaches convergence or the maxmum number of teratons s reache. Step 9: Accorng to Equatons (4.13) an (4.14), the approxmate soluton ' of an can be obtane. ' an Hgh Performance Latent Drchlet Allocaton for Text Mnng 104

Chapter 4 Step 10: The LDA moel wth ' an ' s output an store n the local har sk. Usually, the extracte LDA moel contans a large amount of ata, an takes up major memory space. If there are many classes n the tranng text set, the LDA moel wll not be sutable to be store n memory. So n step 10, the LDA moel s store n the local har sk. When many classes of the tranng text set are processe, the moel wll be loae nto memory so as to save memory space. After fnshng extracton of the LDA moel of a class, the moel can be use to generate the text whch belongs to the class. In bref, the specfc process of the LDA moel generatng a text s as follows [92]: Step 1: The LDA moel s loae from the above store fle so that the approxmate soluton ' an ' can be gane. Step 2: The average length N of all the ocuments whch belong to the class n the tranng text set s calculate. Step 3: Let the length of new text be, an Ø. Step 4: Accorng to the topc probablty strbuton of text ranomly as the topc whch the current wor belongs to. ', a topc j s selecte Step 5: Accorng to wor probablty strbuton on topc j '( z j), a wor w s chose ranomly as a new wor n the new text. An, let w. Step 6: If <N, return to step 4, namely, repeat steps 4 an 5 untl N. Step 7: The ocument s output. In general, n the text set wth nose samples, the number of nose samples n each class relatve to the number of normal samples n the same class s the mnorty [177]. The obtane LDA moel by the Gbbs samplng algorthm can almost reflect the correct semantc nformaton of the class. Although t wll be nfluence by nose samples, the new generate text s bascally close to the class. Base on ths pont, for each category of the tranng text set, the corresponng LDA moel can be got by the Hgh Performance Latent Drchlet Allocaton for Text Mnng 105

Chapter 4 Gbbs samplng algorthm. Then, the obtane LDA moel s use to regenerate all samples of the corresponng classes, whch can replace the orgnal samples as the new tranng samples. Fnally, ths metho can acheve ata smoothng an weaken the nfluence of nose to some extent [92] [95]. The orgnal ata The regenerate ata by LDA A Class 1 Class2 Class1 Class2 Fgure 4.3: The ata smoothng base on LDA weakens the nfluence of nose samples. For example, there are two classes n locaton A n the orgnal tranng ata n Fgure 4.3. The samples of class 2 are ncorrectly labele to class 1. If they are processe wthout ata smoothng, the nose samples wll nfluence the obtane classfer. Some samples wth obvous class features may be sorte nto wrong classes. If the tranng set s processe by ata smoothng base on the generatve process of the LDA moel, the qualty of the regenerate tranng ata s sgnfcantly hgher than the orgnal ata wth nose. 4.3.2.2 The Category Entropy Base on LDA Moel The concept of entropy s from thermoynamcs. In thermoynamcs, entropy s efne as the Log Koc of the number of avalable system states, whch s calle thermal entropy. It s a physcal quantty whch can express the egree of molecular Hgh Performance Latent Drchlet Allocaton for Text Mnng 106

Chapter 4 state clutter. Informaton entropy was frst propose by the Amercan electrcal engneer C.E.Shannon n 1948 [178], whch brought the thermoynamcs entropy nto nformaton theory. The stues have shown that thermal entropy an nformaton entropy are connecte on the quantty. The symbols of nformaton entropy an thermoynamc entropy shoul be opposte [94]. Informaton entropy can measure the uncertanty or amount of nformaton of a ranom event. For example, the sze of the uncertanty of the ranom event can be escrbe by ts probablty strbuton functon. An, the amount of nformaton can be expresse by the sze of the elmnate uncertanty. If a system s more orere, the nformaton entropy wll be smaller. Conversely, f a system s more chaotc, the nformaton entropy wll be hgher. Therefore, nformaton entropy s also a measurement of the egree of the system orerng. Suppose a screte ranom event X has n states, whch are x, 1, x2, xn. Ther corresponng pror probabltes are p, p, 1 2, pn, where n p 1 p( X x X x j ) 0, j an 1. The nformaton entropy H(X) satsfes the followng three contons [178]: Contnuty: H( p1, p2,, pn ) s the contnuous functon of p, where 1,2,,n. Symmetry: H p, p,, p ) has nothng to o wth the orerng of ( 1 2 n p, p, 1 2, p n. Atvty: f p n Q 1 Q2 an Q, Q 0, 1 0 2 Q1 Q H( p 1, p2,, pn ) H( p1, p2,, pn 1, Q1, Q2 ) H( p1, p2,, pn 1) pnh, pn p 2 n Hgh Performance Latent Drchlet Allocaton for Text Mnng 107

Chapter 4 The only expresson of nformaton entropy H s: H n ( p1, p2,, pn ) p( x )log p( x ) 1 (4.15) where s a postve nteger. In general, 1, Equaton (4.15) can be change to Equaton (4.16): H n ( p1, p2,, pn ) p( x )log p( x ) 1 (4.16) Equaton (4.16) s the most basc expresson of nformaton entropy, where s relate to the unt of nformaton entropy. When 2, the unt of entropy s bt. When e, the unt of entropy s nat. When 10, the unt of entropy s t [178]. There are always some ocuments n the tranng text set. When they are processe by a classfer, t s not very efnte that whch class they shoul belong to. Because they may be nose, or may be normal samples whch le on the bounary of classfcaton. If they are the latter, they can be use to mprove the accuracy of the classfers. However, the possblty of these samples belongng to normal samples of the bounary of classfcaton s very small n a nosy envronment. If they are kept, there wll be a huge rsk. Therefore, a metho of category entropy base on the LDA moel can elmnate nose samples effectvely [179] [180]. If we assume that the tranng text set contans n categores whch are C { C, C2,, C,, C 1 n }, the tranng set can be ve nto n subsets whch are D D, D,, D,, D } accorng to the categores. The corresponng LDA { 1 2 n moels M M, M,, M,, M } are extracte by the Gbbs samplng { 1 2 n algorthm from these subsets. For each ocument n the tranng set, the smlarty Hgh Performance Latent Drchlet Allocaton for Text Mnng 108

Chapter 4 between ocument an moel M can be use to represent the lkelhoo of t belongng to class C : W T ( t) M ( t) M w1 t1 P(, M ) ( w) (4.17) sm P(, M ) (, M ) n (4.18) j1 P(, M j ) where W stans for the number of all the wors n ocument. (t) means the M probablty of topc t appearng n moel ( t) M. ( w) shows the probablty of wor M w whch belongs to topc t appearng n moel M. Obvously, f the smlarty between ocument an a moel M s very hgh but the smlarty between ocument an other moels s qute low, the possblty of ocument belongng to the class whch s corresponng to the moel M wll be very large [94]. On the other han, f the smlarty between ocument an every moel s smlar, t wll be ffcult to etermne whch class ocument belongs to. Accorng to the concept of nformaton entropy, the category entropy of each tranng sample s calculate. Then, the samples wth uncertan categores are remove by the comparson of category entropy [178]. The category entropy of ocument can be calculate by the followng Equaton: n H( ) sm(, M ) lg( sm(, )) (4.19) 1 M As can be seen from Equaton (4.19), f the category of ocument s more netermnate, the corresponng category entropy H () s hgher. Because the uncertanty of category of nose samples s larger, ther corresponng category entropy s hgher than the normal samples [181]. Thus, the category entropy of all the Hgh Performance Latent Drchlet Allocaton for Text Mnng 109

Chapter 4 samples n the tranng text set shoul be calculate frstly. Then, all the samples are sorte by ther category entropy. Fnally, the samples whch have the hghest values of the category entropy wll be remove accorng to a certan rato or a threshol, abanonng most of nose samples effectvely [93]. 4.3.3 A Nose Data Reucton Scheme At present, major stues have focuse on classfcaton algorthm research, such as feature selecton, selectng the tranng text an the combnaton of some algorthms, to mprove the performance of the classfer [182]. But, the algorthms about mprovng the qualty of tranng ata are neglecte. Although the above two methos of nose processng base on the LDA moel can remove most of nose samples n the text set effectvely, they cannot remove all the nose completely [93] [94]. In aton, some normal samples are wrongly elete as nose nevtably [96]. Therefore, a nose ata reucton scheme s propose here n the thess, whch combnes the metho of ata smoothng an the metho of nose entfcaton base on the category entropy. The specfc steps of the algorthm are escrbe as follows: Input: the orgnal tranng text set D, the rato of nose reucton q an the maxmum number of teratons K. Output: the fnal tranng text set f D. Step 1: The orgnal tranng text set s uplcate as the current tranng text set 1 D. Step 2: In the new tranng text set 1 D, the LDA moels M { M1, M 2,, M n} are extracte from 1 D for all the classes C { C1, C2,, Cn} by the Gbbs samplng algorthm. Hgh Performance Latent Drchlet Allocaton for Text Mnng 110

Chapter 4 1 Step 3: For each sample ( D ), ther category entropy H () s calculate respectvely accorng to the moels M. Step 4: The obtane category entropy s sorte. Then, the samples whch have the top values of the category entropy n 1 D wll be remove, obtanng the tranng text set 2 D after nose reucton, where the number of the remove samples s 1 q D. Step 5: In the tranng text set 2 1 1 D, the new LDA moels M { M 1 1, M 1 2,, M n} are extracte by the Gbbs samplng algorthm. Step 6: In the tranng text set 2 D, new ocuments are generate by moels M 1 1 1 1 { M 1, M 2,, M n} for all the classes C C, C,, C } { 1 2 n, where the number of new ocuments s the same as the number of ocuments n each class n 1 D. Then, the tranng text set 3 D s gane after ata smoothng. Step 7: If the current number of teratons k reaches the maxmum number of teratons K, the algorthm wll go on to the next step. Otherwse, let return to step 2. 1 3 D D an Step 8: At the en, the fnal tranng text set f D s obtane as: 3 D f D. Hgh Performance Latent Drchlet Allocaton for Text Mnng 111

Chapter 4 Fgure 4.4: The flow agram of a nose ata reucton scheme. The algorthm flow chart n Fgure 4.4 can summarze the above process brefly. In step 5, most of nose samples have been remove by the nose processng base on the category entropy, mprovng the qualty of the tranng text set. Therefore, 1 M s closer to the real topc moel than M. In step 6, the new tranng text set s generate by 1 M so that the qualty of the tranng text set s mprove further, where the number of generate texts s the same as the number of the orgnal tranng text set. At ths tme, an teratve process of the algorthm s complete. However, the tranng text set s usually stll nfluence by nose samples after an teratve process. An, there s the space for mprovng the qualty of the tranng text set. Thus, the smoothe text set wll be the ntal tranng ata of the next teraton. Hgh Performance Latent Drchlet Allocaton for Text Mnng 112

Chapter 4 After teratons, the obtane new text set s the fnal tranng text set. The whole process nees K tmes of teratons. Then, a new classfer can be constructe as the optmal classfer of text classfcaton accorng to the approprate classfcaton algorthm. Usually, 3 tmes of teratons can make the algorthm acheve the optmal state. But f the number of teratons s too much, the effect wll be reuce because of nformaton lose. The reason an etals wll be gven n Secton 4.3.4. 4.3.4 The Experment an Result Analyss In ths chapter, the experment envronment, experment ata set an relate ata preprocessng are the same as the experment n Chapter 3. As shown n Table 4.2, the texts whch are from class 1 an class 2 n Table 3.1 are mxe as the frst group of ata. Then, the texts whch are from class 3, class 4 an class 5 n Table 3.1 are mxe as the secon group of ata. Fnally, the texts whch are from all the fve classes n Table 3.1 are mxe as the thr group of ata. The tranng set an test set are stll ve accorng to the rato of 3:2. An, SVM s chosen as the classfcaton algorthm lke n the experment n Chapter 3. Table 4.2: Expermental ata sets. Group of ata Class The number of texts 1 comp.graphcs, comp.os.ms-wnows.msc 1958 2 comp.sys.bm.pc.harware, 2932 3 comp.sys.mac.harware, comp.wnows.x comp.graphcs, comp.os.ms-wnows.msc, comp.sys.bm.pc.harware, comp.sys.mac.harware, comp.wnows.x 4890 Hgh Performance Latent Drchlet Allocaton for Text Mnng 113

Chapter 4 In aton, the metho of ang nose nto the tranng text set n ths chapter s that the text s ranomly selecte from each class accorng to a certan rato frstly. Then, the selecte text wll be ae nto any class n the tranng text set as the nose ata. Here, the nose rato s the proporton of nose ata n all the ata of the text set. In the experment, the values of nose rato are respectvely 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35 an 0.4. Next, n orer to test that fferent nose ratos nfluence the classfcaton result, the nose rato wll be ncrease lttle by lttle. Frst of all, how the number of teratons K n the propose algorthm of nose processng nfluences the effect of nose processng wll be teste on the frst group of ata. Here, the precson s the proporton of correct samples n the text set after removng nose. As shown n Fgure 4.5, the precson ecreases slowly after more than three tmes of teratons wth varous nose ratos. The reason s that too many tmes of teratons wll cause some nformaton whch s benefcal to classfy after ata smoothng to be lost, whch affects the effect of classfcaton. Base on ths pont, n orer to save computatonal overheas at the same tme, the maxmum number of teratons s generally not more than three. Therefore, the number of teratons K s set to 3 n ths experment, an the rato of nose reucton q s set to 0.4. Hgh Performance Latent Drchlet Allocaton for Text Mnng 114

Precson Chapter 4 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0 1 2 3 4 5 6 7 8 9 10 Number of Iteratons Nose rato = 0.1 Nose rato = 0.2 Nose rato = 0.3 Nose rato = 0.4 Fgure 4.5: The relatonshp between the number of teratons an the effect of nose processng wth fferent nose ratos. In orer to verfy the performance an feasblty of the propose strategy base on the LDA moel, the followng four methos are use as four contrast experments, an SVM s chosen as the classfcaton algorthm for them. Frst, the tranng ata has no any nose processng, whch s test 1. Secon, the tranng ata s only processe by the ata smoothng metho base on the generatve process of the topc moel, whch s test 2. Thr, the tranng ata s only processe by the metho of category entropy base on the LDA moel, whch s test 3. Fourth, all the nose samples are remove from the tranng ata rectly an accurately an other normal samples are kept fully, whch s test 4. The fourth metho s a benchmark reference metho. Usually, ts result s the best that an algorthm of nose reucton can acheve. Hgh Performance Latent Drchlet Allocaton for Text Mnng 115

Classfcaton Precson Chapter 4 Frst Group of Data 1.00 0.90 0.80 0.70 0.60 New Test 1 Test 2 Test 3 Test 4 0.50 0.40 0.30 0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Nose Ratos Fgure 4.6: Classfcaton results of fferent methos wth varous nose ratos n frst group of ata. Fgure 4.6, Fgure 4.7 an Fgure 4.8 show the classfcaton results of fferent methos wth varous ratos of nose, where NEW s the propose metho n Secton 4.3.3. In short, except for test 4, the performance of NEW s sgnfcantly better than other methos n the vast majorty of cases because ts classfcaton precson s affecte by the nose rato weakly, an ts performance s stable. Sometmes, the fluctuaton of the classfcaton precson of test 3 s large n the same ata set. The reason may be that t removes some samples wth the top of category entropy accorng to a certan rato or threshol. If the threshol s set too large, some normal samples wll be abanone mstakenly. On the contrary, f t s set too small, the effect of nose reucton wll be nfluence. Hgh Performance Latent Drchlet Allocaton for Text Mnng 116

Classfcaton Precson Chapter 4 Secon Group of Data 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Nose Ratos New Test 1 Test 2 Test 3 Test 4 Fgure 4.7: Classfcaton results of fferent methos wth varous nose ratos n secon group of ata. In Fgure 4.6, Fgure 4.7 an Fgure 4.8, when the nose rato s low, all the methos of nose reucton can guarantee goo classfcaton precsons. When the nose rato s hgh, bascally, the performance of NEW can stll reman stable. But, the classfcaton precson of test 1 eclnes raply, the classfcaton precson of test 3 sometmes has large fluctuaton. Hgh Performance Latent Drchlet Allocaton for Text Mnng 117

Classfcaton Precson Chapter 4 Thr Group of Data 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Nose Ratos New Test 1 Test 2 Test 3 Test 4 Fgure 4.8: Classfcaton results of fferent methos wth varous nose ratos n thr group of ata. In aton, as can be seen from Fgure 4.7 an Fgure 4.8 that the classfcaton precson of NEW sometmes s better than test 4. The reason may be that nose samples n tranng text set are remove by test 4, reucng the sze of the tranng ata. The sze of tranng ata n NEW s the same as the sze of the orgnal tranng ata, whch s able to fully reflect all nformaton n the tranng ata an s benefcal to mprove the classfcaton precson. Another reason s that there may be some amount of nose samples or the samples whch can nterfere wth the classfer n the orgnal ata set. These samples also can be remove by the nose processng n NEW, enhancng the classfcaton precson. Hgh Performance Latent Drchlet Allocaton for Text Mnng 118

Chapter 4 4.4 Summary At the begnnng, ths Chapter ntrouce two current man methos of selectng the optmal number of the LDA moel, the stanar metho of Bayesan statstcs an the metho of selectng the optmal number of topcs base on HDP. Then, accorng to the ea whch calculates the ensty of samples to measure the correlaton among topcs base on the ensty-base clusterng algorthm, a metho of selectng the optmal number of topcs base on DBSCAN was propose. Its feasblty an valty was prove by the relate experments. Beses, compare wth the above two current methos, the propose metho can fn the optmal number of topcs n the LDA moel automatcally wth relatvely small number of teratons, whch not nee to be assgne the number of topcs manually. Next, two current man LDA-base methos for nose processng were ntrouce, the ata smoothng base on the generaton process of the topc moel an the category entropy base on the LDA moel. Ths chapter combne the two methos, an then propose a nose ata reucton scheme. Its basc ea s to flter out nose samples accorng to the metho of category entropy base on the LDA moel frstly. An then, the generatve process of LDA topc moel was use to smooth the ata. Ths strategy not only reuce the nfluence of nose samples to text classfcaton, but also kept the sze of the orgnal tranng set. The expermental results showe that the propose metho can reuce the nose n the ata set greatly. When the nose rato n the tranng ata set was large, ts classfcaton precson was stll goo. In aton, ts classfcaton precson was affecte by nose rato weakly, an ts performance was stable. Hgh Performance Latent Drchlet Allocaton for Text Mnng 119

Chapter 5 Chapter 5 Genetc Algorthm base Statc Loa Balancng for Parallel Latent Drchlet Allocaton 5.1 Introucton The topc moels such as LDA has become the mportant tools n the fel of text mnng. It s convenent to query, classfy an manage text by latent topc nformaton. However, the tratonal stan-alone LDA has been ffcult to meet the real nees of massve ata because of the lmtaton of storage an computng capacty [72] [117]. The evelopment of clustere harware system an parallel computng software technology has brought an opportunty to solve ths ssue. Parallel programmng technology can eal wth computng an storage problems well n the cluster, so parallelzaton of LDA s one of current man methos of processng large-scale ata [45] [69]. In aton, loa balancng s a common problem n parallel computng. To choose or esgn a reasonable scheulng strategy an relate parameters s able to mnmze synchronzaton watng tme among the processors n the heterogeneous cluster, mprovng the effcency of parallel computng [130] [133]. Ths chapter begns wth a bref ntroucton of four parallel LDA methos: Mahout s parallelzaton of LDA, Yahoo s parallel topc moel, the algorthm of parallel LDA base on Belef Propagaton (BP) an Google s PLDA. Then, the basc structure an workng process of MapReuce on Haoop parallel computng platform s ntrouce brefly. The strbute programmng moel of MapReuce s use to stuy the mplementaton of parallelng AD-LDA. Next, a statc loa balancng strategy base on genetc algorthm for PLDA s propose, whch can mprove the performance of parallel program. Fnally, the experments prove that the effcency of parallel LDA base on the MapReuce framework s hgher than the tratonal stan-alone moel, Hgh Performance Latent Drchlet Allocaton for Text Mnng 120

Chapter 5 an the propose loa balancng strategy s feasble. 5.2 The Current Man Parallelzaton Methos of LDA The essence of LDA s a seral algorthm, so the current parallel approaches nevtably use approxmate assumptons to transform t to parallel LDA algorthm [72] [188]. Its basc ea s to make every computer n a cluster moel an learn local fles only, an transfer, exchange an upate the requre parameters by network. At present, there are three man parallel programmng technologes, such as share memory moel whch s represente by OpenMP [49] [50], message passng moel whch s represente by Message Passng Interface (MPI) [50] an ata parallel moel whch s represente by MapReuce [52] [66]. They can mplement parallel computng of three LDA moel parameters nference methos whch are Varatonal Bayes (VB) [31] [32], Gbbs Samplng (GS) [107] [108] an Belef Propagaton (BP) [106]. The followngs are ther parallel algorthms n etal. 5.2.1 Mahout s Parallelzaton of LDA In the early phase, the research of parallel topc moel focuse on the algorthm of parallel varatonal EM [183]. Parallel samplng s carre out by the parallel executon of E-steps an M-steps. Conserng that the computatonal overhea of E-steps s relatvely larger an easer to be parallelze, Nallapat et al manly parallelze E-steps to mprove the effcency of the algorthm [184]. Wolfe et al propose the varatonal EM metho whch can parallel E-steps an M-steps at the same tme, an have prove the effect n a seres of network topology [185]. Mahout [186] s one of open source lbrares for machne learnng algorthms n the Apache project, whch can process large-scale ata. Its strbute algorthm s manly Hgh Performance Latent Drchlet Allocaton for Text Mnng 121

Chapter 5 base on the Haoop Map-Reuce archtecture. The strbute LDAEM algorthm n [184] s use n Mahout. Its basc ea s to use varatonal nference an EM algorthm to approxmate the posteror probablty of latent varables. In E-steps, the upate task of parameter an s assgne to multple slave noes. Each slave noe wll calculate the corresponng statstcs accorng to the upate local parameter an, an return the statstcs to master noe. Fnally, the master noe wll gather all the statstcs an carry out the process of estmatng moel parameter an n M-steps. The basc process of Haoop whch realzes Mahout s parallelzaton of LDA s escrbe as follows [184] [187]. Frstly, the nput text ata shoul be formatte as the form of key-value pars, where key stans for the ocument label an value represents the vector that s mae up of wors an ther frequency. Then, the formatte ata s store n Haoop Dstrbute Fle System (HDFS). HDFS can store the nput ata n fferent machnes, but execute fle management operatons accorng to a unfe logcal aress. In the map phase, E-steps n LDA base on varatonal nference are carre out n parallel, upatng the varatonal parameters of each ocument an returnng the ntermeate results an the normalze tems. In the reuce phase, the obtane ntermeate results from the map phase wll carry out smple vector aton accorng to values. At the en, the matrx s normalze to gan fnal topc-wor probablty parameter. Here, t s assume that the ntal parameter n each ocument s fxe so that t s not necessary to estmate the varatonal parameter. Logarthmc functon s use to represent the ntermeate results, whch can convert complex operaton of vector multply-ve to smple operaton of vector a-subtract, reucng the computatonal overhea. Hgh Performance Latent Drchlet Allocaton for Text Mnng 122

Chapter 5 5.2.2 Yahoo s Parallel Topc Moel The framework of parallel topc moel [188] s a parallel framework of LDA, whch was propose by Yahoo Lab [189]. Its core ea s that the upate spee of global statstcs (topc-wor matrx) an the upate spee of local statstcs (ocument-topc matrx) are fferent n the process of Gbbs samplng. So the global statstcs s upate belately, reucng the overhea of passng parameters. An, the asynchronous upate s aopte among the fferent machnes, whch can avo that the machnes wth a fast sample rate wat for the machnes wth a slow sample rate [190]. The framework of Yahoo s parallel topc moel ves the algorthm of parallelzaton nto two parts [188] [190]: sngle machne mult-core parallelzaton an mult-machne parallelzaton. The basc ea of sngle machne mult-core parallel algorthm s that a sngle ocument relatve to the massve text set affects the global statstc weakly, whch s assume. The global statstcs can be upate belately after a sngle ocument fnshng Gbbs samplng, whch means that the global statstc s constant when a sngle ocument executes Gbbs samplng. Thus, the process of Gbbs samplng can be ecompose nto multple sub-processes. A large number of threas are smultaneously use to carry out Gbbs samplng n parallel to each wor n each ocument by the ppelne process, ecreasng the tme overhea. The basc ea of mult-machne parallelzaton s that a local statstcs an a share global statstcs are mantane n each machne. Each machne fnshes Gbbs samplng of local ocument accorng to the local global statstcs, asynchronously synchronzes the upate ata nto the share global statstcs of other machnes. Therefore, the upate of the global statstcs can be complete asynchronously so that the algorthm can keep on the next teraton of Gbbs samplng wthout watng for all the machnes fnshng the current roun of Gbbs samplng an completng the upate of the global statstcs. It can enhance the effcency of the algorthm s parallelzaton greatly. Hgh Performance Latent Drchlet Allocaton for Text Mnng 123

Chapter 5 The basc process of achevng Yahoo s parallel topc moel s that [188] [191] the nput text ata s formatte as the form of key-value pars frstly. An then, key-value pars wll be store n HDFS. In sngle mult-core parallelzaton, the ppelne processng s use to carry out Gbbs samplng to a sngle ocument by Intel Threang Bulng Blocks (Intel TBB) [192], realzng the algorthm. Intel TBB s a multthreae programmng lbrary base on C++, whch makes users wrte software that can take full avantage of the performance of mult-core processor convenently. In mult-machne parallelzaton, the corresponng server an clent are establshe n each machne by the Internet Communcatons Engne (ICE) [193], to transfer the global statstcs an upate local global statstcs asynchronously. ICE s a content-orente strbute tool, whch was prove by ZeroC company. It proves the API of strbute programmng. On the clent-se, the ata s store n hash table. The operaton request of Put&Get s launche to the server-se by the mechansm of Asynchronous Metho Invocaton (AMI). On the server-se, the ata s store n the strbute hash table. 5.2.3 The Algorthm of Parallel LDA Base on Belef Propagaton Ja Zeng an Newman use OpenMP an MPI to mplement the algorthm of parallel LDA base on BP, whch can eal wth the bottlenecks of computaton tme an memory storage on the share-base memory [72] [106]. In bref, OpenMP s an extensble stanar, whch can support multple programmng languages, cross platforms. An, t proves users wth a smple an flexble nterface [49]. OpenMP uses the Fork-Jon parallel moel to acheve the parallelzaton of tasks. Data nteracton s realze by accessng the same share memory aress. MPI s one of the most popular parallel programmng mechansms, whch s use n parallel computers, workstatons an clusters of strbute storage. MPI ves a large-scale computng task nto many small parts frst. Then, the small parts wll be assgne to Hgh Performance Latent Drchlet Allocaton for Text Mnng 124

Chapter 5 fferent noes of the cluster to acheve the parallelzaton of task. Fnally, the moel parameters are upate by the Map-Reuce synchronzaton or Receve-Sen asynchronzaton to acheve the ata nteracton [50]. The basc ea of MPI mplementng the algorthm of parallel BP s to segment the ocument-wor co-occurrence matrx frst. An then, the ata wll be assgne to all noes of the cluster, obtanng local moel parameters by usng the algorthm of BP. At the en, the synchronzaton an upates of the global moel parameters are realze by callng a lbrary functon MPI_Allreuce n MPI. The basc process of mplementaton s that D tranng ocuments are assgne to P processors n the cluster evenly so that each processor can gan D/P ocuments. Next, the parameters are ntalze. The algorthm of BP s execute n each processor, obtanng local parameters an. After enough teratons, the moel s converge at the en. Fnally, s reuce by callng a lbrary functon MPI_Allreuce, ganng the global moel parameters [50] [106]. The basc ea of OpenMP mplementng parallel BP s to strbute large-scale ata to multple processors. Each process runs nepenently but shares the memory aress, so the moel parameters are global. Here, the moel parameters are atomze so that t can avo the access conflct to the same aress effectvely. The corresponng basc process of mplementaton s that the parameters are ntalze frst. Then, the text set s assgne to multple threas to execute the algorthm of BP n parallel. The nformaton content n each wor that s n text of a sngle threa s calculate. Here, the nformaton content whch s carre by s atomze so that multple threas wll not rea an wrte the same memory aress at the same tme. It can avo the access conflct. Next, n orer to avo access conflct as well, the moel parameters are upate an the nformaton content whch s carre by s atomze agan. Fnally, the above process s terate untl the algorthm s converge, obtanng the global moel parameters [50] [106]. Hgh Performance Latent Drchlet Allocaton for Text Mnng 125

Chapter 5 5.2.4 Google s PLDA Newman et al propose the parallel LDA moel base on Gbbs samplng, Approxmate Dstrbute LDA (AD-LDA) an Herarchcal Dstrbute LDA (HD-LDA) [72] [117]. Each threa calculates the local moel parameters n AD-LDA, an then the global moel parameters are upate by reucton. HD-LDA as a super parameter n topc-wor ter, an analyzes many aspects of the performance of the algorthm n etal, such as perplexty, convergence an spee-up rato. Asuncon et al put forwar an asynchronous parallel topc moel, an gave the mplementaton of two man algorthms LDA an HDP [194]. Porteous et al propose a fast Gbbs samplng metho an the corresponng parallel LDA algorthm [71]. Wang et al propose a parallel LDA algorthm base on MPI an MapReuce, an compare the features an results of the two mplemente methos. In aton, the experments prove ts feasblty an effcency [195]. PLDA s a parallel LDA algorthm, whch was propose by Google base on AD-LDA [195]. It can estmate the posteror strbuton of latent varables by Gbbs samplng, an realze the parallel algorthm respectvely on MPI an MapReuce strbute programmng platforms. In AD-LDA, the wor-topc strbuton s ntalze ranomly, an the statstcs of wor-topc n the whole text set s compute frst. Then, the nput ata wll be assgne to fferent processors. The assgne part of the text set whch s regare as the whole text set s execute Gbbs samplng n each process. After all the processors fnsh the Gbbs samplng, the wor-topc strbuton n the whole text set wll be gathere to get the global statstcs. Fnally, the obtane global statstcs wll become the ntal value of the global statstcs n the next teraton of Gbbs samplng. In the mplementaton of PLDA base on MPI [195], each noe wll call the AllReuce API of MPI after fnshng local Gbbs samplng. After fnshng current teraton of Gbbs samplng n all noes, the change nformaton of the statstcs n Hgh Performance Latent Drchlet Allocaton for Text Mnng 126

Chapter 5 each noe wll be transferre to other noes, an be accumulate to the global statstcs. Meanwhle, the change nformaton of the statstcs n other noes an the upate global statstcs are receve, whch wll be regare as the ntal value of the global statstcs n the next teraton of Gbbs samplng. In the MapReuce verson, the Haoop parallel computng platform s use to mplement PLDA [195]. In each teraton of Gbbs samplng, each mapper noe carry out Gbbs samplng to local ata, whch s calle Map phase. The changes of local statstcs an the new topc whch s assgne to each wor are recore. After all mapper noes fnshng local Gbbs samplng, the ata s transferre among all the noes, whch s calle Shuffle phase. The changes of local statstcs as the ntermeate results are transferre to reucer noes, whch s calle Reuce phase. The reucer noes gather the upates of all mapper noes an work out the new global statstcs, whch wll be regare as the ntal value of the global statstcs n the next teraton of Gbbs samplng. After enough teratons, the global moel parameters an wll be obtane. 5.2.5 Analyss an Dscusson In Mahout s parallelzaton of LDA, multple processors operate the share varables at the same tme result n rea-wrte conflct, whch makes the effcency of parallel varatonal EM algorthm relatvely low. Compare wth Mahout s parallelzaton of LDA an Google s PLDA, Yahoo s parallel topc moel can mprove the effcency of parallel computng greatly. But, ts basc ea s to assume that to upate the global statstcs belately wll not affect the fnal nference result, whch means that the global statstcs s constant n a sngle ocument samplng process. Unfortunately, the experments prove that t sacrfces some accuracy when ealng wth the real ata set. Hgh Performance Latent Drchlet Allocaton for Text Mnng 127

Chapter 5 In fact, when parallel BP algorthm base on MPI eals wth large-scale ata, the proporton of communcaton tme n each processor ncreases wth the ncreasng number of processors. Meanwhle, the memory employ s large. Therefore, there s some lmtaton for ts applcaton n ultra-large-scale topc moelng. In aton, the parallel BP algorthm base on OpenMP works on share memory. So t almost only occupes 1/P n the algorthm base on MPI, whch reuces the memory consumpton greatly. Here, P s the number of processors. Beses, t oes not have the reuce communcaton tme among the MPI threas, ecreasng the calculaton tme greatly. However, all the threas run nepenently but share the memory aress. So, multple threas may carry out rea-wrte operatons on the same memory aress at the same tme, causng the memory access conflct. In aton, the algorthm base on OpenMP can only be use n the share memory systems, namely, ts scalablty s not strong. Table 5.1: The features comparson of mplementng PLDA wth MPI an MapReuce. Dstrbute programmng moels Communcaton moe Fault-recovery MPI Through memory/network Not supporte MapReuce Through GFS/Dsk IO Bult n Table 5.1 shows the features comparson of mplementng PLDA wth MPI an MapReuce. Wthout conserng computng falure of the processor, the effcency of MPI-PLDA s hgher than MapReuce-PLDA because the ata of ts teratve calculaton s kept an ae up n the memory, an the whole process oes not nee sk IO. But, f the number of processors an the sze of processe ata ncrease, the falures n some processors can not be gnore. MapReuce has the functon of fault recovery but MPI oes not support. Uner the crcumstances, once a machne fals, Hgh Performance Latent Drchlet Allocaton for Text Mnng 128

Chapter 5 the whole calculaton process n MPI-PLDA wll be force to restart. Therefore, ths chapter focuses on MapReuce-PLDA, an there wll be a etale research about ts algorthm mplementaton n the next secton. In aton, the overhea of sk IO s very large n MapReuce-PLDA because the map phase nees to recor the status of the topcs beng assgne to wors n the text set to the har sk [117] [195]. 5.3 Parallelng LDA wth MapReuce/Haoop Base on the above analyss an comparson, the Haoop strbute platform an the PLDA algorthm are use to stuy the mplementaton process of parallel LDA n ths chapter. Gbbs samplng n LDA s a seral algorthm n essence, so to parallelze Gbbs samplng s base on the followng assumpton: there are m processors, an all the ata can be ve nto m parts. Each processor executes Gbbs samplng to local ata set nepenently. As long as the global statstcs n each processor s upate n tme, the fnal nference result s approxmate wth the seral nference result. In other wors, the effect of strbute samplng on the fnal result can be gnore, whch has been prove n [188]. 5.3.1 The Workng Process of MapReuce on Haoop In the Haoop parallel platform, ts bottom s HDFS. It stores the fles of all the storage noes n a Haoop cluster [63] [64]. The upper ter of HDFS s the MapReuce engne. MapReuce s compose of two user-efne functons, Map functon an Reuce functon. The nput an output of Mapper an Reucer are store n HDFS. Programmers only nee to focus on specfc computng tasks of the Map functon an Reuce functon, other complex problems n parallel calculaton wll be processe by the operaton system backgroun of MapReuce, such as partton of ata set, job scheulng, fault tolerance an communcaton among noes [65] [66]. A typcal workng process of MapReuce s shown as follows [52] [61] [64] [66]: Hgh Performance Latent Drchlet Allocaton for Text Mnng 129

Chapter 5 Input phase: when a map task runs, the path of nput an output an some other operatng parameters shoul be ncate. The InputFormat class analyzes the nput fles, an splts the large ata fle nto a number of nepenent ata blocks. The ata blocks are rea n the format of key-value pars < key1, value1 >. Map phase: A user-efne mapper class can carry out any operaton on the key-value par < key1, value1 >. After the operaton, the OutputCollect class s calle to generate a batch of new ntermeate results n the format of key-value pars < key2, value2 >. The type of key-value par < key2, value2 > can be fferent from the nput key-value par < key1, value1 >. Shuffle an Sort phase: n orer to ensure that the nput of Reuce s the sequence output of Map, the ntermeate results wth the same key are processe by one Reuce as far as possble. In the Shuffle phase, the ntermeate results wth the same key are gathere nto the same one noe as far as possble. In the Sort phase, the nput of Reuce wll be sorte by the moel accorng to the value of key. Usually, the two phases of Shuffle an Sort are execute n parallel. Reuce phase: n all ntermeate ata, the user-efne Reuce functon s carre out to each unque key. The key-value pars wth the same key wll be merge, where nput parameter s < key2, (lst of values2) > an the output s new key-value pars < key3, value3 >. Output phase: the results of Reuce are outputte to HDFS by callng the OutputCollect class. 5.3.2 The Algorthm an Implementaton of PLDA Every teraton of Gbbs samplng n PLDA [195] s regare as a MarReuce job n Haoop. When each ata noe carres out Gbbs samplng, two matrxes wll be calculate. One s the wor-topc count matrx Wor C, where each element C wk represents the frequency of wor w beng assgne to topc k. Another s the Hgh Performance Latent Drchlet Allocaton for Text Mnng 130

Chapter 5 ocument-topc count matrx frequency of wors n ocument Doc C, where each element C kj expresses the j beng assgne to topc k. Brefly, when Haoop s use to mplement PLDA, Gbbs samplng s execute to local ata n the Map phase an text strbuton, topc strbuton an Wor C are upate n the Reuce phase. Haoop assgns D tranng texts to P mappers, an each mapper eals wth D p D / P texts. The corresponng strbutons of text W an topc Z are also assgne to P mappers, whch s expresse as W,, W } an Z,, Z }. Here, { 1 P { 1 P W P an Z P stan for that they are only processe n mapper p. Thus, the ocument-topc count matrx Doc C s also strbute to each mapper, whch s represente as Doc C p. Each mapper keeps ts own wor-topc count matrx Wor C. At ths moment, text s the key, text strbuton an topc strbuton are the value. Each mapper executes Gbbs samplng to local ata whch s treate as the whole text set. Mapper upates the corresponng topc strbuton of each text, an then output text strbuton an topc strbuton to Data Reucer. Text s key, text strbuton an topc strbuton are value. After the teraton of Gbbs samplng to all the ocuments, mapper wll output Wor C whch s the upate of Wor C to Moel Reucer. Wor s key, an Wor C s value. Therefore, when PLDA s realze by the Haoop parallel platform, two reucers are use to process an output the upate text strbuton, topc strbuton an Wor C. In the Reuce phase, the output of map s respectvely transferre to Data Reucer an Moel Reucer, whch entfy the nput key an nvoke fferent reuce algorthms accorngly. Hgh Performance Latent Drchlet Allocaton for Text Mnng 131

Chapter 5 Fgure 5.1: The framework of one Gbbs samplng teraton n MapReuce-LDA. As shown n Fgure 5.1, Data Reucer saves the output of map text strbuton an topc strbuton to HDFS after Gbbs samplng teraton. For each wor, Moel Reucer wll gather an output the upate Wor C of ths wor. After all mappers completng Gbbs samplng, Data Reucer an Moel Reucer wll aggregate the ntermeate results, obtanng the upate global statstcs whch s treate as the ntal value of next teraton of Gbbs samplng. The nput an output of the MapReuce job n PLDA are the same n terms of format, both contanng text strbuton, topc strbuton an wor-topc count matrx Wor C. So the Gbbs samplng teraton of PLDA n Haoop can be ensure, namely each output of reuce can be the nput of map. 5.4 A Statc Loa Balancng Strategy Base on Genetc Algorthm for PLDA n Haoop In the algorthm of PLDA, after each ata noe fnshng Gbbs samplng to local ata, ata exchange wll be execute to acheve synchronzaton of parallel algorthms. If the loa of each noe s mbalance, the noes wth a small loa wll wat for the Hgh Performance Latent Drchlet Allocaton for Text Mnng 132

Chapter 5 noes wth a large loa, whch wll nfluence the parallel effcency. Therefore, ths chapter proposes a statc loa balancng strategy base on a genetc algorthm for PLDA n a heterogeneous cluster. Data parttonng s execute before the start of the task. Accorng to the vsble loa theory, the synchronzaton watng tme among ata noes shoul reuce as far as possble, enhancng the performance of parallel executon [127] [128] [129] [131] [133]. 5.4.1 The Algorthm Desgn Frst of all, n Haoop, the total tme T of a mapper processng a ata block nclues the followng four parts: Data copyng tme t c : t s the tme of a ata block beng cope from HDFS to a local har sk. In realty, t epens on the realstc network banwth an wrtng spee of the har sk. Data processng tme t p : t s the computng tme of the mapper to process a ata block. Mergng tme of ntermeate ata t m : n the Shuffle phase, t s the operatng tme that the ntermeate ata whch s outputte by mapper wth the same key s merge to transfer to reucer as the nput. Buffer releasng tme t b : t s the tme of releasng the flle buffer untl empty. Therefore, T t t t t. (5.1) c p m b Then, before the algorthm esgn, the followng assumptons are mae. D s s the sze of the ata block. H w s the wrtng spee of the har sk, an ts unt s MB/S. Hgh Performance Latent Drchlet Allocaton for Text Mnng 133

Chapter 5 B n s the realstc network banwth, an ts unt s MB/S. processor runnng the mapper process, an ts unt s MB/S. P s s the spee of the B s s the sze of the buffer of the mapper. R a s the rato of the sze of ntermeate ata to the sze of the orgnal ata block. N f s the frequency of processng the ntermeate ata. N b s the number of tmes of the buffer beng flle up. V b s the amount of the processe ata when the buffer s flle up. S h s the sort factor of Haoop. Thus, Ds tc (5.2) mn( H, B ) w n where t c epens on the practcal har sk an network banwth. Ther spee wll affect the tme of the ata beng cope from HDFS to the local har sk of the mapper. In aton, t s obvous to gan Equaton (5.3). D s t p (5.3) Ps When a buffer s beng flle, the processor nees to contnue to wrte the ntermeate ata to the buffer, an meanwhle the process of releasng mantans that the sorte ata s wrtten to the har sk from the buffer. So, the spee of fllng a buffer can be expresse as P R s a H w. Then, the tme of a buffer beng flle up s P R s B a s H w. As a result, when a buffer s flle up, the processor wll generate the ntermeate results wth the sze of V b whch can be represente as Equaton (5.4). V b Bs P s Ra ( ) (5.4) P R H s a w The total amount of ntermeate ata beng generate from the orgnal ata block Hgh Performance Latent Drchlet Allocaton for Text Mnng 134

Chapter 5 wth the sze of D s can be expresse as D R. Therefore, the number of tmes of s a a buffer beng flle up can be expresse by Equaton (5.5). N b D R s a (5.5) V b If the tme of a buffer beng release once s B H s w, the tme of a buffer beng release for N b tmes wll be B N b H s w. So, t b B s Nb (5.6) H w The frequency of processng the ntermeate ata N f can be efne by Equaton (5.7). Nb N f 1 (5.7) S h When the ntermeate ata begns to be merge, the tme of all the ntermeate ata beng wrtten nto the har sk s D R s H w a. If the process of mergng happens N f tmes, the operaton tme of sk IO wll be N f Ds Ra H w. Therefore, t m D R s a N f (5.8) H w In Haoop MapReuce, the total tme of processng ata blocks n one processng wave T all epens on the maxmum consume tme of k partcpatng mappers, whch can be shown n Equaton (5.9). T all max( T1, T2,, Tk ) (5.9) because the fnshe processng mappers have to wat for other mappers to complete Hgh Performance Latent Drchlet Allocaton for Text Mnng 135

Chapter 5 processng. In orer to realze the mnmzaton of T all, accorng to the vsble loa theory [132] [133] [134], t s assume that all the mappers whch are nvolve n the calculaton complete local ata processng at the same tme. In other wors, t can be expresse as the followng Equaton. T 1 T 2 T k (5.10) 5.4.2 The Desgn an Implementaton of Genetc Algorthm A genetc algorthm s a self-aaptng an global optmzng probablty search algorthm, whch can make computer smulate the genetc an evolutonary behavors n lvng nature [122]. The basc ea of ths algorthm s that the ntal populaton s compose by cong frst. Accorng to the ftness functon, the genetc operators such as selecton, crossover an mutaton are use n the evoluton of populaton. Fnally, the global optmal soluton wll be obtane [197] [198]. 5.4.2.1 Encong Scheme When solvng practcal optmzaton problems, genetc algorthms can process chromosome (nvual) n the form of genes strng. Thus, a genetc algorthm has to convert the solutons of an optmzaton problem nto the form of chromosomes. In fact, the cong process s that the problem space s mappe to the chromosome space [199]. Usually, genetc algorthms aopt bnary cong. Its avantages are smple operatons of cong an econg, whch wll be benefcal to acheve crossover an mutaton operatons. However, when Haoop s use to mplement PLDA n a heterogeneous computng envronment, a huge amount of text ata wll be processe. In ths case, Hgh Performance Latent Drchlet Allocaton for Text Mnng 136

Chapter 5 bnary cong makes the length of chromosome too long an requres too much memory, whch wll nfluence the performance of the whole computng platform. In aton, bnary cong cannot reflect the nherent structure of the problem rectly. When t s use to solve the hgh-mensonal optmzaton problem, the accuracy s not hgh. In orer to overcome the savantages of bnary cong, ths chapter uses ecmal cong. It can treat the sze of the ata block whch s assgne to each mapper as chromosomes so as to execute genetc operatons rectly. 5.4.2.2 The Intalzaton of Populaton Usually, the sze of populaton s set from 20 to 200 [125]. After many experments, the sze of populaton s set to 100 n ths chapter. After settng the sze of populaton, ts ntalzaton process s to generate 100 chromosomes ranomly. Ths chapter uses ecmal cong rather than nrect cong, so the length of each chromosome an the value of genes epen on the assgne ata block from the Haoop platform to each mapper. 5.4.2.3 Ftness Functon Generally, convertng an objectve functon nto a ftness functon shoul follow two prncples [120] [122]: The ftness must be non-negatve. The change recton of the objectve functon n the optmzaton process s the same wth the change recton of the ftness functon n the group evoluton process. Accorng to the algorthm esgn n Secton 5.4.1, the ftness functon of Haoop-base parallel LDA can be expresse as Equaton (5.11). Hgh Performance Latent Drchlet Allocaton for Text Mnng 137

Chapter 5 k 1 2 f ( T) ( T T ) (5.11) where T stans for the processng tme of the -th mapper. T means the average tme of k mappers processng ata, whch can be expresse as T k T 1. k As long as the value of ftness f (T ) of the nvual s smaller than a gven ftness threshol, whch means that k mappers almost complete local ata processng at the same tme, the teraton of the genetc algorthm wll be converge an en. Otherwse, the obtane new generaton of populaton wll replace the prevous generaton, an return to the genetc operatons to contnue the loop executon untl the teratve convergence. In aton, the maxmum number of generatons for the algorthm can be set as another teratve termnaton conton, avong that the algorthm cannot converge. As long as the teratons reach the gven maxmum number of generatons, the genetc algorthm wll be termnate. 5.4.2.4 Crossover The crossover operator of genetc operatons s the man metho to generate new nvuals, whch etermnes the global search ablty of genetc algorthm an plays a central role n genetc algorthm. Crossover s an operaton that part of genes strng n two parent nvuals s exchange an recombne to generate new nvuals n some way [124]. Crossover operator can mantan the features of excellent nvuals n the orgnal populaton n a certan extent, an t also can make the algorthm explore the new gene space. So the new nvuals are generate n the next generaton, mantanng the versty of nvuals n the new populaton [200]. Ths chapter aopts sngle-pont crossover, whch s often use n smple genetc algorthms. The specfc operatons are as follows: the crossover probablty P c s set Hgh Performance Latent Drchlet Allocaton for Text Mnng 138

Chapter 5 to 0.5 n ths chapter. Accorng to ths P c, two nvuals are selecte ranomly from the populaton as parent nvuals. If the length of the gene strng s L, an a crossover pont s etermne ranomly, there may be L-1 crossover locatons. When the chromosomes cross, the coe value of the crosse genes n two selecte nvuals wll be exchange an two new nvuals wll be generate. But, the new nvuals may not be all retane n the next generaton. They nee to be compare wth the orgnal parent nvuals. If the new nvuals can aapt the ftness functon better, they wll be kept n the next generaton. 5.4.2.5 Mutaton The basc ea of a mutaton operator s that the genetc values of some locatons n the nvual gene strng n the populaton are change to generate a new nvual. Mutaton s a kn of local ranom search methos [200]. When t s combne together wth a crossover operator, some permanent loss of nformaton cause by the crossover operator can be avoe an the effectveness of genetc algorthms can be guarantee. Beses, t can make the search avo fallng nto local optmum [201]. Furthermore, n orer to prevent some genes n the populaton from beng stuck n a constant state, mutaton operaton s requre [122]. It makes genetc algorthms mantan the versty of the populaton, ensurng that the populaton can contnue to evolve an preventng the premature convergence [200]. Therefore, mutaton n genetc algorthms s a helper metho to generate new nvuals, an an effectve measure to avo premature convergence of the genetc algorthm [118] [119]. Crossover nees two parent chromosomes to match, but mutaton only nees one parent chromosome. In general, for each genetc value of the offsprng nvuals generate n crossover operaton, a pseuo-ranom number between 0 an 1 s prouce. If t s smaller than P m, mutaton operaton wll be execute. The basc Hgh Performance Latent Drchlet Allocaton for Text Mnng 139

Chapter 5 steps of smple mutaton operaton apple n ths chapter are as follows: Step 1: Mutaton probablty P m s set to 0.01, to mutate the genetc value of the above gene. Step 2: If the nvual mutates, the genetc value of ts gene wll be replace by new value, to generate a new nvual. 5.4.2.6 The optmal retenton strategy Ths chapter as the optmal retenton strategy nto smple genetc algorthm, to prevent that the hghest ftness nvuals may be estroye n the process of crossover an mutaton. Before crossover an mutaton, two nvuals wth the hghest ftness n the current generaton are mantane, an they o not partcpate n genetc operatons. After nvuals n the current generaton complete genetc operatons, the two mantane nvuals are use to compare an replace two nvuals wth the lowest ftness n the populaton, whch can ensure better populaton evoluton an mprove the computatonal effcency of genetc algorthms. 5.5 Experment an Analyss 5.5.1 Evaluatng PLDA n Haoop In ths secton, the Haoop parallel framework s use to be the experment platform. But, because the expermental conton s lmte, such as the cluster s very small an the computng capacty s nsuffcent, the test ata set s not large but t s real. The purpose of the experment s to verfy the feasblty an effectveness of the PLDA algorthm, an fn the loa balancng problem. So, ths secton oes not nvolve the genetc algorthm base loa balancng strategy. Hgh Performance Latent Drchlet Allocaton for Text Mnng 140

Chapter 5 5.5.1.1 The Expermental Envronment Due to the lmte expermental conton, nne ornary esktop computers are use to consttute a Haoop cluster n the local area network. Haoop platform s the Master/Slave structure, so one of them s use as the master noe to o the work of NameNoe an JobTracker. An, other eght machnes are the slave noes to o the work of DataNoe an TaskTracker. In a homogeneous envronment, for each esktop computer, CPU s Intel(R) Core(TM) 2 Qua Q6600 2.40 GHz, ram s 4GB, an the har sk s SATA 500GB wth 7200rpm. In a heterogeneous envronment, base on the homogeneous envronment, the harware of two slave noes wll be change. For example, CPU s Intel(R) Pentum(R) IV 2.8GHz, ram s 1GB, an har sk s SATA 80GB wth 5400rpm. Furthermore, the etals about the confguraton of the expermental envronment are shown n Table 5.2. Table 5.2: The confguraton of the expermental envronment. Network banwth 1000Mb/s Operatng system Lnux Feora 12 Experment platform Haoop verson 0.20.204 Java executon envronment JDK 1.6.25 Integrate Development Envronment(IDE) Eclpse (Lnux verson) In the above harware system, the specfc steps of bulng the Haoop platform are escrbe as follows [189] [196]: Step 1: The operatng system Lnux Feora 12 s nstalle n the nne esktop computers. Hgh Performance Latent Drchlet Allocaton for Text Mnng 141

Chapter 5 Step 2: As shown n Table 5.3, each computer s name an s connecte to the swtchboar, where haoop1 s the Master noe an others are Slave noes. The fle /etc/hosts of each machne s mofe to ensure that the host name of each machne an the corresponng IP aress can be analyze correctly wth each other. For example, n the Namenoe, namely n Master haoop1, the IP aress of all the computers n the cluster an the corresponng host names nee to be ae nto the hosts fle. In a Datanoe, namely n other eght Slave noes, the local IP aress, IP aress of Namenoe an ts corresponng host name nee to be ae nto the hosts fle. Table 5.3: The confguraton of noes n a Haoop cluster. Computer name IP aress Functon haoop1 192.168.1.100 Master haoop2 192.168.1.101 Slave haoop3 192.168.1.102 Slave haoop4 192.168.1.103 Slave haoop5 192.168.1.104 Slave haoop6 192.168.1.105 Slave haoop7 192.168.1.106 Slave haoop8 192.168.1.107 Slave haoop9 192.168.1.108 Slave Step 3: JDK s the Java kernel to run Haoop, so JDK 1.6 or later verson has to be nstalle n Lnux before Haoop s nstalle n each machne. Ths experment chooses an nstalls jk-6u25-lnux-586-rpm.bn, namely JDK 1.6.25. Step 4: In each computer, the Eclpse ntegrate evelopment envronment wth Hgh Performance Latent Drchlet Allocaton for Text Mnng 142

Chapter 5 MapReuce Plugns s nstalle, whch can evelop the MapReuce program, confgure the Haoop cluster fle, an operate HDFS very convenently. Step 5: Generally, SSH (Secure Shell) servce s nstalle when nstallng Lnux. In the Haoop cluster, both NameNoe an JobTracker use SSH to start an stop the aemon process of each noe, so for the convenent operatons, SSH startng automatcally when startng up s set an SHH loggng n wthout passwor s confgure n each noe. Step 6: Haoop s nstalle n each machne n the cluster, an ts verson s 0.20.204. Accorng to the ntroucton n the Haoop project offcal webste, the confguraton fles of the Haoop cluster are mofe. For nstance, JDK path of haoop-env.conf s mofe an haoop-ste.conf whch s the man confguraton fle of Haoop s confgure. Step 7: After completng confguraton, all the confguraton fles are cope nto all noes n the cluster. At ths moment, to set up a Haoop cluster envronment s fnshe. Moreover, n the experment of evaluatng PLDA, the efault confguraton s use n Haoop. Here, the efault FIFO scheuler s use to be the Haoop scheuler. Each slave noe can run up to 2 Map tasks at the same tme. Accorngly, each noe s able to run at most 2 Reuce tasks at the same tme. 5.5.1.2 Expermental Data Ths experment uses the real text ata set 20 Newsgroups to be the expermental ata. It s compose of 20 fferent classes of news, an collecte 18828 samples. All the samples are respectvely n 20 folers, an the sze of the total fle s 32.8 MB [164]. At present, 20 Newsgroups has been wely apple to the text applcaton Hgh Performance Latent Drchlet Allocaton for Text Mnng 143

Chapter 5 technologes wth machne learnng, nclung text classfcaton an text clusterng [163]. In the text set, text preprocessng an the parameter settng for mplementng the LDA algorthm are the same wth the processng operatons n Chapter 3. 5.5.1.3 The Experment n the Homogeneous Envronment Test 1: To verfy the feasblty an effectveness of Haoop mplementng the PLDA algorthm. The text set of 20 Newsgroups s real, but the sze of the ata set s small. In orer to test the computng performance of Haoop realzng the PLDA algorthm n fferent number of noes, the ata set wth 640MB s constructe by copyng. In the whole Haoop cluster, the noes can be ae or remove very easly because of the scalablty of Haoop. 1 to 8 ata noes are use to process the above constructe ata set respectvely. Fgure 5.2 shows the computng tme of fferent number of ata noes ealng wth the same amount of ocuments. As shown n Fgure 5.2, along wth the ncrease of ata noes, the computng tme of processng the same ata reuces. To ncrease the number of ata noes can sgnfcantly mprove the computatonal effcency of the Haoop cluster processng the same sze of ata. In aton, NameNoe an JobTracker are only responsble for task scheulng, an o not partcpate n the calculaton, thus the move noes o not nclue NameNoe an JobTracker. Therefore, the feasblty an effectveness of the PLDA algorthm have been mprove agan n Haoop. Hgh Performance Latent Drchlet Allocaton for Text Mnng 144

Computng Tme (s) Chapter 5 8000 7000 6000 5000 4000 3000 2000 1000 0 0 1 2 3 4 5 6 7 8 Number of Data Noes Fgure 5.2: The computng tme wth fferent number of ata noes n a homogeneous cluster. Test 2: The comparson of sngle machne an one ata noe n the cluster processng the ata. In the same harware an software confguraton envronment, the performance of sngle machne an one ata noe n the Haoop cluster wll be compare when they process the same scale of ata. In the contrast experment, the sze of ata set s ncreasng graually, such as 1.1MB, 2.7MB, 5.3MB, 16.2MB, 32.8MB an 53.7MB, so that cross-valaton s execute successfully. Hgh Performance Latent Drchlet Allocaton for Text Mnng 145

Chapter 5 Table 5.4: The expermental result of sngle machne an one ata noe n the cluster processng the ata. Tmes of Fle sze (MB) The number of The overhea of a The overhea of one ata experments sample ocuments sngle machne noe runnng parallel LDA runnng seral LDA (S) n a Haoop cluster (S) 1 1.1 320 14 132 2 2.7 780 46 146 3 5.3 1600 109 173 4 16.2 6400 378 208 5 32.8 18828 Reportng nsuffcent 253 memory 6 53.7 29500 Reportng nsuffcent 335 memory Table 5.4 shows that when the sngle machne runs the seral LDA algorthm, wth the growth of the nput ata, the computng tme ncreases raply an the algorthm performance ecreases sgnfcantly. The reason s that wth the ncrease of the nput ata, the consumpton of memory an other resource s too large on the sngle machne, resultng n the machne performance eclnes sharply, even untl reportng that the memory s not enough to complete computng tasks. However, the Haoop cluster can complete the computng tasks, whch means that the PLDA algorthm has the ablty of ealng wth large-scale ata n the Haoop cluster. But when the ata set s small, the processng effcency of PLDA s much lower than the processng effcency of the sngle seral LDA n the Haoop cluster. Because the start an scheulng of Job, the communcaton among the noes an sk IO nee to consume a certan amount of tme, an ts proporton of the total consumpton s large. The actual computaton tme n the proporton of the total consumpton s small. Hgh Performance Latent Drchlet Allocaton for Text Mnng 146

Computng tme (s) Chapter 5 In aton, a large number of small ata fles make the storage effcency of HDFS low, an ts retreval spee s slower than that wth the large fles. When runnng MapReuce operatons, the ata s assgne to each mapper accorng to block n Haoop. These small ata fles wll consume much computng power. Thus, the PLDA algorthm s esgne to process the large ata set whch the sngle machne cannot hanle. Meanwhle, t s also prove that the Haoop platform s sutable for processng large-scale ata. 5.5.1.4 The Experment n the Heterogeneous Envronment In orer to test the performance of the algorthm, the constructe an preprocesse text ata of 20 Newsgroups s ve nto fve fferent szes of ata sets, such as 300MB, 400MB, 500MB, 600MB an 700MB. Eght ata noes n the heterogeneous an homogeneous cluster respectvely execute contrast tests to the above fve fferent szes of ata. Ther specfc confguraton s as shown n Secton 5.5.1.1. The Homogeneous Cluster The Heterogeneous Cluster 3000 2500 2000 1500 1000 500 0 300 400 500 600 700 Data Szes (MB) Fgure 5.3: A comparson of computng tme of ealng wth fferent szes of ata wth eght ata noes. Hgh Performance Latent Drchlet Allocaton for Text Mnng 147

Chapter 5 Fgure 5.3 shows the runnng tme of processng fferent szes of the text set when the same number of ata noes s use n the homogeneous an heterogeneous cluster. It can be seen that the total runnng tme of the homogeneous cluster ncreases bascally lnearly wth the ncrease of the ata set. On the other han, the total runnng tme of the heterogeneous cluster s not stable enough, a lttle fluctuaton, but the PLDA algorthm stll can be acheve n the heterogeneous cluster. In aton, the computng effcency of the homogeneous cluster s better than that of the heterogeneous cluster when processng the large ata set. The reason s that the efault scheulng approach of the Haoop platform s bult on the homogeneous cluster. An, when PLDA s realze n Haoop, the ata set s strbute equally accorng to the number of mappers n the cluster. When the heterogeneous cluster eals wth a large ata set, especally the number of noes an the performance fference among the noes are large, the loa mbalance wll be generate. The noes wth strong performance an fast spee of processng have to wat for the noes wth poor performance an slow spee of processng, affectng the computatonal effcency of the whole cluster. 5.5.2 The Experment of Loa Balancng n a Smulate Envronment The extra overhea generate from task scheulng n a Haoop cluster s nevtable, an ts proporton n the total overhea s not small. However, f there are ozens or even hunres of noes n the Haoop cluster, the proporton of the extra scheulng overhea n the total overhea wll reuce to a very low an acceptable level. In aton, f the amount of ata s greater an the number of ata noes s more n the Haoop platform, ts avantages wll be more obvous. Due to the lmte expermental contons, the number of noes n the cluster s small an the sze of the test ata s not large. So, t s ffcult to show the avantages of the Hgh Performance Latent Drchlet Allocaton for Text Mnng 148

Chapter 5 Haoop platform fully, an the loa balancng problem s not obvous. Therefore, HSm [203] s use to smulate the large-scale cluster processng massve ata n ths secton, verfyng the effectveness of the statc loa balancng strategy base on the genetc algorthm. HSm s a smulator, whch can smulate a Haoop envronment [202] [203]. Its structure, basc ea an workng process are the smlar wth Haoop. Its am s to smulate the behavor of the Haoop framework accurately. Brefly, the framework of Haoop s moele an smulate by HSm n four aspects, such as noe parameters, cluster parameters, Haoop system parameters an HSm s own parameters. It can support the smulaton of homogeneous an heterogeneous Haoop computng envronment. Beses, the performance of HSm has been verfe by the publshe benchmark results an the publshe an customze MapReuce applcatons. The expermental result showe that HSm has hgh accuracy an stablty when smulatng the Haoop applcatons. Therefore, HSm s able to smulate a varety of MapReuce jobs. However, HSm s not perfect, an also has some lmtatons. In a smulate large cluster, the performance of HSm wll be affecte. But n a smulate cluster wthn 100 noes, HSm can work well [203]. Thus, n orer to evaluate the loa balancng algorthm, a cluster wth 40 noes s smulate n ths smulaton experment. A hgh-performance computer s use to run HSm. Its harware confguraton s as follows: CPU s Intel(R)Core(TM)7-880 up to 3.73 GHz,RAM s 16GB an har rve s 2TB SATA wth 7200rpm. Its software confguraton s the same wth Secton 5.5.1.1. Table 5.5 shows the confguraton of the smulate Haoop envronment when evaluatng the performance of the loa balancng strategy for PLDA. Hgh Performance Latent Drchlet Allocaton for Text Mnng 149

Chapter 5 Table 5.5: The confguraton of the smulate Haoop envronment. Number of smulate ata noes 40 Number of processors n each noe 1 Number of cores n each processor 2 Number of har sk n each noe 1 Number of mappers n each noe 2 Number of reucer n each noe 2 The processng spees of processors Reang spee of har sk Wrtng spee of har sk Depenng on H() 80MB/s 40MB/s Sort factor 100 In the smulate Haoop envronment, the processng spee of ata noe epens on the heterogenety of the Haoop cluster. The total computng capacty of the cluster can be expresse as n 1 D t, where n stans for the number of ata noes n the cluster. represents the processng spee of the -th ata noe. If the total computng capacty of the Haoop cluster s D t, the heterogenety of the Haoop cluster can be efne as Equaton (5.12). n 1 2 H( ) ( ) (5.12) where means the average processng spee of all the ata noes n the Haoop cluster, whch can be shown as n 1 n D n t. Hgh Performance Latent Drchlet Allocaton for Text Mnng 150

Overhea (s) Chapter 5 Test 1: For the same ata, the performance comparson wth fferent heterogenetes. Frst of all, the ata wth the sze of 10GB s teste wth the fferent heterogenety n the smulate Haoop cluster, namely the value of H () s from 0 to 2.41. Fgure 5.4 shows that when the value of H () s less than 1.21, the smulate cluster s close to a homogeneous envronment. In the PLDA algorthm, the performance fference between the statc loa balancng strategy base on genetc algorthm an the efault FIFO scheulng n Haoop s not obvous. But wth the ncrease of the value of H (), namely the ncrease of the heterogenety of the smulate cluster, the propose strategy can reuce the overheas greatly an mprove the computatonal effcency. Wthout Loa Balancng Wth Loa Balancng 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 0 0.31 0.61 0.91 1.21 1.51 1.81 2.11 2.41 Dfferent Heterogenety Fgure 5.4: The performance comparson of the loa balancng strategy wth the fferent heterogenety n a smulate envronment. Hgh Performance Latent Drchlet Allocaton for Text Mnng 151

Overhea (s) Chapter 5 Test 2: For the same heterogenety, the performance comparson wth fferent szes of ata. The sze of the ata set s from 10GB to 100GB when H () s set to 2.11. Ths test s to evaluate the performance of the propose strategy ealng wth fferent szes of ata sets. As shown n Fgure 5.5, wth the ncrease of the sze of the ata set, ts effect s not very sgnfcant. But, compare wth the efault FIFO scheulng n Haoop, t can reuce the overheas of the PLDA algorthm. The reason s that the propose statc loa balancng strategy s base on genetc algorthm. It also nees some computng tme when runnng. Furthermore, wth the ncrease of the processe ata set, the overhea of the loa balancng strategy tself s also on the rse. Wthout Loa Balancng Wth Loa Balancng 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 10 20 30 40 50 60 70 80 90 100 Data Szes (GB) Fgure 5.5: The performance comparson of PLDA wth fferent szes of ata n a smulate envronment. Hgh Performance Latent Drchlet Allocaton for Text Mnng 152

1 46 91 136 181 226 271 316 361 406 451 496 541 586 631 676 721 766 811 856 Overhea of PLDA (s) Chapter 5 However, the PLDA algorthm stll can get benefts from the loa balancng strategy. For example, when H () s 2.11, the overhea of the propose strategy tself s 673s. The overhea of one processng wave of mappers wth loa balancng s 1408s. But, the overhea of one processng wave of mappers wthout loa balancng s 4711s. So, the algorthm performance mproves by 55.8%. Therefore, n a heterogeneous envronment, the propose statc loa balancng strategy base on the propose genetc algorthm s effectve. Test 3: The stablty of the statc loa balancng strategy. In fact, the convergence of genetc algorthm n the propose statc loa balancng strategy wll also affect the effcency of the PLDA algorthm n Haoop. In orer to analyze the convergence of the genetc algorthm, the overheas an the evolutonary generatons of PLDA processng a ata set wth the sze of 10GB are teste n the smulate Haoop envronment. 450 400 350 300 250 200 150 100 50 0 Number of Generatons Fgure 5.6: The convergence of genetc algorthm n the loa balancng strategy. Hgh Performance Latent Drchlet Allocaton for Text Mnng 153

Chapter 5 As shown n Fgure 5.6, the overhea of PLDA s long n the early evoluton, an there are sgnfcant fluctuatons. Wth the ncrease of the evolutonary generatons, the genetc algorthm s able to keep the potental goo genes. When the number of teratons reaches 400, the algorthm s bascally converge an the optmal nvuals n the populaton can be searche. At ths moment, the PLDA algorthm reaches a stable performance. Thus, the propose loa balancng strategy s effectve. 5.6 Summary Frstly, ths chapter ntrouce an compare the current man parallel LDA methos, such as Mahout s parallelzaton of LDA, Yahoo s parallel topc moel, the algorthm of parallel LDA base on Belef Propagaton (BP) an Google s PLDA. Then, the PLDA algorthm an ts mplementaton process n Haoop were escrbe n etal. After stuyng the current man Haoop task scheulng algorthm, ths chapter brought genetc algorthms nto the Haoop platform, an propose a statc loa balancng strategy base on a genetc algorthm for PLDA. The bggest feature of ths strategy was that n a heterogeneous computng envronment, the ata was splt accorng to the vsble loa theory before the start of the task, whch can mnmze the synchronzaton watng tme among the noes. Meanwhle, the populaton searchng technology of genetc algorthms was use to make the populaton evolve to contan or approxmate the optmal soluton, mprovng the performance of the Haoop cluster. In the real homogeneous an heterogeneous Haoop cluster envronment, the experment showe that when the PLDA algorthm ealt wth a huge number of ata n Haoop, ts effcency was hgher than the tratonal seral algorthm. So, t s an effectve way to process the massve ata. In the Haoop smulaton envronment, the experment prove that the propose statc loa balancng strategy was effectve an stable, an t can enhance the computatonal effcency of the Haoop cluster. Hgh Performance Latent Drchlet Allocaton for Text Mnng 154

Chapter 6 Chapter 6 Concluson an Future Works 6.1 Concluson In the nformaton era, wth the rap evelopment of computer technologes, web pages, emals, atabases, gtal lbrares an other electronc texts have been ncreasng fast. Most of the nformaton ata s n an unstructure or sem-structure form. It s very ffcult to scover the knowlege from t, but t s qute sgnfcant. How to fn the useful an requre nformaton quckly an accurately from lots of nformaton systems has become a challenge n the fels of nformaton scence an technology at present. The probablty topc moel such as LDA proves an effectve tool for text analyss. It treats the ocuments as the probablty strbuton on topcs so that the probablty moel s establshe for the text generaton process an the probablty moel parameters are trane. Brefly, t can extract the latent topc nformaton from text effectvely. In LDA, the core proceure s to estmate the posteror probablty strbuton of topcs. The man approxmaton methos nclue the varatonal nference base on the EM algorthm an Markov Chan Monte Carlo metho base on Gbbs samplng. The avantage of the former s that the varatonal parameters an teraton are use to approxmate the Drchlet parameters. Because ts convergence spee s fast, the number of teratons s less. But, the calculaton n each teraton s complex, especally the computng overhea s enormous when ealng wth massve text. The avantage of the latter s to construct the Markov chan of the posteror probablty. The global statstcs s upate by samplng, an the number of teratons s fxe. The samplng algorthm n the teraton s smple, an the computng overhea s small. But, accorng to the nature of Markov chans, lots of teratons are requre to acheve a stable state. These two algorthms have ther own avantages an Hgh Performance Latent Drchlet Allocaton for Text Mnng 155

Chapter 6 savantages respectvely. However, comparatvely speakng, the scalablty of LDA base on Gbbs samplng s better, an t s easer to realze the parallelzaton of LDA. Therefore, all the research work focuse on the LDA base on Gbbs samplng n ths thess. Text classfcaton s one of the key technologes of text mnng. It can eal wth the problem of heterogeneous an messy text nformaton to some extent, reucng the search space, speeng up the retreval spee an mprovng the query accuracy. So, t s an effectve way to classfy the text nformaton automatcally, an t s a new challenge an a new evelopment opportunty for the nformaton nustry. However, text classfcaton technologes are facng some problems at present. For example, the feature menson s too hgh, the number of text categores an the number of samples are too much, there s lots of nose an the number of samples n each class s unbalance. Beses, the representaton of text feature an the selecton of text classfcaton algorthm mpact the entre text classfcaton system greatly. Amng at some problems n the tratonal text classfcaton system, the probablty topc moels were stue an ther avantages an savantages were compare n Chapter 2 n ths thess, especally the LDA topc moel. LDA s a total probablty generaton moel, so t has a clear nternal structure. An, t can take avantage of effectve probablty nference algorthms to calculate. The sze of ts parameter space has nothng to o wth the number of tranng ocuments so that LDA s more sutable for hanlng large-scale corpus. In aton, t s a herarchcal moel, an t s more stable than other topc moels. Beses, the SVM classfcaton algorthm has a unque excellent performance for text classfcaton. In Chapter 3, the LDA moel wth goo performance of text representaton an the effect of menson reucton an SVM algorthm wth strong classfcaton ablty were combne, an a text classfcaton metho base on the LDA moel was propose. In the framework of SVM, the LDA moel was use to moel the text set. Hgh Performance Latent Drchlet Allocaton for Text Mnng 156

Chapter 6 Then, Gbbs samplng n Markov Chan Monte Carlo (MCMC) was use to nfer an compute the moel parameters nrectly, to obtan the probablty strbuton of text on the topc set. Fnally, SVM was trane n the latent topc-ocument matrx, constructng a text classfer. The classfcaton experment n 20newsgroups whch s a real text set verfe the valty an superorty of the propose classfcaton metho base on the LDA moel. The expermental result showe that t was effectve to overcome the amage problem of classfcaton performance whch was cause by the tratonal feature selecton metho. An, t can mprove the classfcaton accuracy when reucng the ata menson. LDA as a metho of feature selecton, when ts effect of menson reucton was compare wth SVM alone, feature vectors obtane a sgnfcant menson reucton, savng the tranng tme an enhancng the classfcaton effcency. In aton, compare wth the combnaton of LSA an SVM, the propose metho ha a better classfcaton performance n the text set wth many classes. When LDA s use to moel a whole text set, the number of topcs T wll nfluence the performance of LDA moelng the text set greatly. In other wors, the gven number of topcs an the ntal value of topc strbuton play an mportant role n the parameters learnng process of LDA. In Chapter 4, the current man selecton methos of the optmal number of topcs base on the LDA moel were ntrouce, such as the selecton metho base on HDP an the stanar metho n Bayesan statstcs. The former uses the nonparametrc feature of DP to solve the selecton problem of the optmal number of topcs n LDA. But, t nees to bul a HDP moel an a LDA moel respectvely for the same ata set. Obvously, t wll take lots of tme an computng when ealng wth large-scale text sets or real text classfcaton problems. The latter nees to specfy a seres of values of T manually, where T s the number of topcs n LDA. After executng Gbbs samplng, the values of log p ( w T) Hgh Performance Latent Drchlet Allocaton for Text Mnng 157

Chapter 6 can be obtane n sequence so that the value of T n LDA can be etermne. In Chapter 4, the relaton between the optmal moel an the topc smlarty was prove theoretcally. In the LDA moel, the moel was optmal when the average smlarty among the topcs was the smallest. Base on ths constrant conton, the selecton of the value of T an the parameter estmaton of the LDA moel were ntegrate n a framework. So, accorng to the basc ea of computng the sample ensty n DBSCAN, a selecton algorthm of the optmal number of topcs base on DBSCAN was propose. By calculatng the ensty of each topc, the most unstable topc moel structure can be foun. It was terate untl the moel was stable. Fnally, the optmal number of topcs n LDA for the text set can be foun. The expermental result showe that the propose selecton metho of the optmal number of topcs base on DBSCAN was effectve. Compare wth two current man methos, t can etermne the value of T automatcally wth relatvely lttle teratons. Especally, t not nee to set the number of topcs manually n avance. In the real applcatons of text classfcaton, the tranng ata nevtably contans nose. The nose samples woul nfluence the fnal classfcaton result sgnfcantly. At present, there are two man LDA-base nose processng methos, such as the ata smoothng metho base on the LDA moel an the nose entfcaton metho base on the category entropy. Both of them can effectvely remove some nose samples to some extent, but they cannot completely elete all the nose ata from the ata set. In aton, t s nevtable to remove some normal samples as nose mstakenly. Therefore, the ata smoothng metho base on the LDA moel an the nose entfcaton metho base on the category entropy were combne n Chapter 4. An then, a nose ata reucton scheme was put forwar. Its basc ea was to brng the concept of nformaton entropy nto text classfcaton, an the nose samples were fltere accorng to the category entropy. Then, the generaton process of the LDA Hgh Performance Latent Drchlet Allocaton for Text Mnng 158

Chapter 6 topc moel was use to smooth the ata, weakenng the nfluence of the remanng nose samples after flterng nose. Meanwhle, t can keep the orgnal sze of the tranng set. The experment wth the real ata set showe that when the nose rato was large n the tranng ata set, the classfcaton accuracy of the propose strategy was stll goo. In other wors, the classfcaton accuracy was affecte by the nose rato weakly, an the performance was stable. Because the capactes of storage an computng are lmte, the tratonal sngle LDA cannot meet the nees of the real large-scale ata set. So, the parallelzaton of LDA has become one of man methos an research prortes of processng the massve text ata. In Chapter 5, the current man methos of parallelng LDA were ntrouce, such as Mahout s parallelzaton of LDA, Yahoo s parallel topc moel, the algorthm of parallel LDA base on Belef Propagaton (BP) an Google s PLDA. Accorng to three man parallel programmng technologes whch are OpenMP, MPI an MapReuce, the followng three nference methos of LDA moel parameters realze ther parallel computng, such as Varatonal Bayes (VB), Gbbs Samplng (GS) an Belef Propagaton (BP). After comparson an analyss, the PLDA algorthm an Haoop platform were chosen an stue n ths thess. An then, three current man Haoop task scheulng algorthms were analyze. FIFO s smple, but ts lmtaton s large. Far scheulng has some mprovement, but t cannot support large memory operatons. Capacty scheulng s able to optmze the resource allocaton effectvely an enhance the operatng effcency. However, t has to confgure lots of parameter values manually, whch s qute ffcult for the normal users who o not unerstan the etals of the Haoop cluster. In aton, all of them are sutable for the homogeneous envronment. But n the heterogeneous envronment, they wll waste a lot of resources an the overheas wll be large, affectng the overall performance of the Haoop platform. Therefore, ths thess brought a genetc algorthm whch s an ntellgent optmzaton Hgh Performance Latent Drchlet Allocaton for Text Mnng 159

Chapter 6 algorthm nto the Haoop platform [204] [205] [206]. An, conserng that fferent noes n the cluster have fferent computng spee, fferent communcaton capablty an fferent storage capacty, a statc loa balancng strategy base on a genetc algorthm for PLDA was propose. Its basc ea was to assume that the total tme T of a mapper processng a ata block contane four parts,.e., the ata copyng tme t c, ata processng tme releasng tme t b. In a heterogeneous Haoop cluster, t p, mergng tme of ntermeate ata t m an buffer T all whch s the total tme of processng ata blocks n one processng wave epens on the maxmum consume tme of k partcpatng mappers, namely Tall max( T1, T2,, Tk ). Accorng to the optmal prncple of the vsble loa theory, each ata noe shoul complete local Gbbs samplng at the same tme n theory, namely T 1 T 2 Tk. Base on a genetc algorthm, the optmal prncple was esgne an accomplshe n Chapter 5. It s manly emboe n the etermnaton of the ftness functon n the genetc algorthm. The mean square error s use to measure the fference among the total tme T of each mapper processng the ata n the ftness functon. As long as the value of the ftness s smaller than a gven ftness threshol, whch means that k mappers almost completes local ata processng at the same tme, the teraton of the genetc algorthm wll be converge an en. Thus, an optmal or near optmal soluton can be foun to etermne the sze of one ata block for each mapper, mnmzng the completon tme of the parallel LDA. In fact, to partton the ata before the start of the task an to mnmze the synchronzaton watng tme among each ata noe as far as possble can mprove the performance of the Haoop cluster n a heterogeneous envronment. Fnally, the experment was set up n a real Haoop cluster envronment, whch ha one name noe an eght ata noes. In the homogeneous computng envronment an the heterogeneous computng envronment, the feasblty an valty of Haoop Hgh Performance Latent Drchlet Allocaton for Text Mnng 160

Chapter 6 mplementng the PLDA algorthm were prove respectvely. Especally va the comparson of sngle machne an one ata noe n the cluster processng the ata, the experment prove that the PLDA algorthm s esgne to eal wth the large ata set whch sngle machne cannot hanle. Due to the lmte expermental conton, HSm, whch s a smulator, was use to smulate a Haoop cluster wth 40 noes to process 10GB to 100GB ata. The valty of the propose statc loa balancng strategy base on the genetc algorthm was verfe n three aspects, whch are fferent heterogenetes, fferent szes of ata an stablty, where the heterogenety of the Haoop heterogeneous cluster was also etermne an acheve by the mean square error. For example, for the same ata but fferent heterogenetes, the PLDA wth the propose loa balancng strategy s faster than the PLDA wthout the loa balancng strategy obvously when the heterogenety s bgger than 1.21. An, for the same heterogenety but fferent szes of ata, the PLDA wth the propose loa balancng strategy s stll always faster than the PLDA wthout the loa balancng strategy. In the smulate envronment, f the value of the mean square error s smaller, the heterogenety of the Haoop cluster s smaller. On the contrary, f the value of the mean square error s bgger, the heterogenety of the Haoop cluster s bgger. 6.2 Future Works The research work of ths thess manly focuse on the LDA probablty topc moel an the Gbbs samplng algorthm, whch s one of posteror nference methos. In orer to mprove the accuracy, on the one han, the LDA moel an SVM were combne to construct a new text classfer. On the other han, the optmal number of topcs n LDA can be foun automatcally an the nose ata can be remove from the text set effectvely. In the aspect of enhancng the processng capacty an spee, the Haoop platform was use to realze the parallelzaton of LDA, makng the LDA moel eal wth the real large-scale text ata effectvely. Ths thess has acqure the above research results, but there are stll some problems an lmtatons. Therefore, Hgh Performance Latent Drchlet Allocaton for Text Mnng 161

Chapter 6 the followng research work wll nee to be one n the future. 6.2.1 The Supervse LDA More an more ata wth users label appears n the Internet, so to moel the ata wth label has become people s actual nees [207]. The tratonal LDA moel s an unsupervse learnng moel. It only moels wors wthout conserng other label nformaton. The supervse LDA algorthm can prove an effectve moelng tool for the ata wth label, an brng the text label nformaton nto the tratonal LDA moel [208]. It s able to calculate the probablty strbuton of latent topcs n each class, overcomng the efect that the tratonal LDA moel has to allocate latent topcs n a sngle class forcbly [209]. Beses, t can mprove the representaton capablty of the LDA moel [210]. Thus, to stuy an apply the supervse LDA wll become one of future works. At present, the representatve supervse LDA algorthms nclue author-topc moel [211] [212] [213] an Labele-LDA [214] [215]. In the author-topc moel [211] [212], the relatonshp between the ocuments an the authors s moele by the topcs, whch can apply to search the authors wth the smlar topcs an other tasks. It brngs the varables whch can represent author nformaton nto the tratonal LDA. Multple labels, namely author nformaton, are assgne to each ocument, an each of the authors set s from the author labels set. In ths moel, the authors n the authors set etermne the ocument. For each poston n the ocument, an author s selecte from the authors set ranomly. An then, the selecte author wll choose the topc z. Next, accorng to the probablty strbuton of the topcs on wors, wors are selecte. Fnally, Gbbs samplng s use to estmate the moel parameters [213]. There are a large number of ocuments wth users label n the Internet. Labele-LDA Hgh Performance Latent Drchlet Allocaton for Text Mnng 162

Chapter 6 [214] can moel the wors n the ocuments wth labels an the generaton process of labels. Accorng to the moel parameters, the label strbuton of new ocuments s estmate. It brngs n the categores nformaton, whch has one more ter than the stanar LDA moel. The ter can be calle text category ter. Ths algorthm s manly use to etermne the specfc labels whch are assgne to each wor n the ocument wth mult-label. An then, the strbuton of each label n the whole ocument wll be estmate. Because the label of each ocument s observable, the parameters of the label strbuton have nothng to o wth other parameters n the moel [215]. Therefore, Gbbs samplng of the tratonal LDA can be use to estmate the moel parameters. 6.2.2 The Improve PLDA PLDA+ The PLDA algorthm has two lmtatons: frstly, the wor-topc matrx s not parttone. So, the whole matrx has to be store n a ata noe memory. Seconly, the overhea of ata exchange n the Reuce phase s hgh when ealng wth the large-scale ata. In orer to solve these lmtatons, [216] put forwar an enhance verson of the PLDA algorthm, whch was calle PLDA+. It not only can partton the ocument-topc matrx, but also can splt the wor-topc matrx. In bref, the noes n the cluster are ve nto two groups n the PLDA+ algorthm [216]. The task of a group of noes s to mantan the splt ocument-topc matrx an carry out Gbbs samplng. Beses, the topcs are assgne to the wors n the ocument. The task of another group of noes s to mantan the splt wor-topc matrx, an accept an accumulate the upate of the wor-topc matrx from the prevous group of noes. Thus, the PLDA+ algorthm can halve the bottleneck of ata exchange, reucng the communcaton tme among the noes an enhancng the spee an performance of the algorthm. In aton, assgnng the above two kns of tasks to two groups of noes make t possble to eal wth the problem of loa Hgh Performance Latent Drchlet Allocaton for Text Mnng 163

Chapter 6 balancng more flexbly [217]. 6.2.3 Dynamc Loa Balancng Problem Loa balancng strategy s a task scheulng strategy [218] [219], an s a key ssue of obtanng the hgh performance an mprove the computng scalablty an applcaton scale n a strbute computng envronment. The research work of ths thess focuse on the statc loa balancng strategy n a Haoop cluster. It assume that the loa can be etermne an splt before runnng. But n a real large-scale cluster such as clou computng platform, f the loa s measure only when runnng an s splt ynamcally, or the fference of the operatonal capablty among the noes or the loa fluctuaton s large, the Dynamc Loa Balancng (DLB) strategy shoul be consere [220]. The research showe that compare wth the statc loa balancng strategy, the DLB strategy usually can obtan 30% to 40% performance mprovement [221]. Currently, accorng to the control poston feature of the DLB strategy, the exstng DLB strateges can be ve nto three categores, such as the strbute DLB strategy, the centralze DLB strategy an the hybr herarchy DLB strategy [220]. The statc loa balancng strategy n ths thess was base on genetc algorthms. So, to make t change nto DLB strategy for the ynamc loa envronment wll be one of the future works. For example, the selecton of crossover probablty P c an mutaton probablty P m n genetc algorthms rectly affects the convergence an the performance of the algorthm. Usually, the tratonal P c an P m are statc manual settng. Srnvas et al propose a metho of ynamc parameters settng, where P c an P m can change automatcally along wth the ftness to aapt the ynamc computng envronment an ecrease the lmtaton of the manual selecton Hgh Performance Latent Drchlet Allocaton for Text Mnng 164

Chapter 6 parameters [222]. If P c an P m ncrease, the change of the nvuals wll ncrease, but the values of the ftness of the nvuals wll eclne. It s benefcal to guarantee the global solutons, but not conucve to make the populaton converge to the optmal soluton. If P c an P m ecrease, the values of the ftness of the nvuals wll ncrease, but the change of the nvuals wll eclne. It can help the populaton converge to the optmal soluton, but cannot ensure the global optmal soluton [223] [224]. Therefore, t s necessary to set P c an P m ynamcally. 6.2.4 The Applcaton of Clou Computng Platform Due to the lmte expermental conton, the number of noes n the cluster s not large, an the computng an storage capacty s not strong. Thus, a smulator HSm was use to mplement the propose statc loa balancng strategy n ths thess. In the smulaton envronment, t has goo performance an stablty, but there wll be a challenge when processng the real massve ata. At present, there are many large clou computng platforms, such as Google s Google App Engne, Amazon s Amazon EC2 (Amazon Elastc Compute Clou) an Amazon S3 (Amazon Smple Storage Servce), IBM s Blue Clou an Mcrosoft s Wnows Azure [225] [226] [227]. Therefore, applyng the propose loa balancng strategy to the above large clou computng platforms an obtanng goo performance wll become one of the future works. 6.2.5 The Applcaton of Interscplnary Fel LDA probablty topc moel not only can be apple to the fel of text analyss, but also can be apple to mages, auo, veo an other structure nformaton analyss. For example, t can be apple to mage annotaton an mage classfcaton. The key s to efne the wors an the ocuments reasonably. Usually, the pxels block n the Hgh Performance Latent Drchlet Allocaton for Text Mnng 165

Chapter 6 mage s treate as the wor, the sngle mage s treate as the ocument, an all the mage lbrares are regar as the ocument set [228] [229]. At the same tme, the features of the pxels block are use to carry out the cong, generatng the wor lst of the mage lbrary. Beses, the LDA moel can be use to learn the mage lbrary [230]. Compare to text, the mage tself has some characterstcs, such as the hgh mensons n mage elements, the strong relevance an the low sperson. Therefore, the establshment of topc moels an the selecton of moel parameters are worthy of the further stuy, whch wll also become the emphass of the future research. From the research an analyss of the LDA moel an ts parallel algorthm n ths thess, t can be seen that LDA s a fast an effectve text analyss tool. In aton, the ervatve algorthm base on LDA topc moel s an mportant ssue n the fel of machne learnng. At present, when the LDA moel an ts relate algorthms eal wth the real massve text ata, lots of work wll be requre further, nclung the stuy of theoretcal knowlege an the analyss of applcaton scenaros. On the whole, t s a fel of scentfc research wth practcablty, whch s worthy of ntensve stuy. Hgh Performance Latent Drchlet Allocaton for Text Mnng 166

References References Hgh Performance Latent Drchlet Allocaton for Text Mnng 167

References References [1] Lu, B. (2011) Web Data Mnng: Explorng Hyperlnks, Contents, an Usage n Data. 2 en. New York: Sprnger Publshng. [2] Han, J.W., Kamber, M. an Pe, J. (2006) Data Mnng: Concepts an Technques. n 2 en. Burlngton: Morgan Kaufmann. [3] Tan, P.N., Stenbach, M. an Kumar, V. (2005) Introucton to Data Mnng. Boston: Ason-Wesley. [4] Kantarzc, M. (2011) Data Mnng: Concepts, Moels, Methos, an Algorthms. n 2 en. Hoboken: Wley-IEEE Press. [5] Felman, R. an Sanqer, J. (2006) The Text Mnng Hanbook: Avance Approaches n Analyzng Unstructure Data. Cambrge: Cambrge Unversty Press. [6] Berry, M.B. an Koqan, J. (2010) Text Mnng: Applcatons an Theory. Colorao: Wley. st 1 en. [7] Song, M. an Wu, Y.F.B. (2008) Hanbook of Research on Text an Web Mnng Technologes. Hershey: Informaton Scence Reference. [8] Francs, L. (2006) Tamng Text: An Introucton to Text Mnng. Casualty Actuaral Socety Forum, pp. 51-88. [9] Karankas, H. an Theoouls, B. (n..) Knowlege Dscovery n Text an Text Mnng Software. Centre for Research n Informaton Management Department of Computaton, UMIST, Manchester, UK [Onlne]. Avalable at: http://www.crm.co.umst.ac.uk [Accesse 26 October 2011]. [10] Lent, B., Agrawal, R. an Srkant, R. (1997) Dscoverng Trens n Text Databases. IBM Almaen Research Center San Jose, Calforna, USA [Onlne]. Avalable at: http://www.almaen.bm.com/cs/projects/s/hb/publcatons/papers/k97_trens.p f [Accesse 26 October 2011]. [11] Mner, G., Eler, J., Hll, T., Nsbet, R., Delen, D. an Fast, A. (2012) Practcal Text Mnng an Statstcal Analyss for Non-structure Text Data Applcatons. Waltham: Acaemc Press. Hgh Performance Latent Drchlet Allocaton for Text Mnng 168

References [12] Zanas, A. (2007) Text Mnng an ts Applcatons to Intellgence, CRM an Knowlege Management. Ashurst: WIT Press. [13] Srvastava, A. an Saham, M (2009) Text Mnng: Classfcaton, Clusterng, st an Applcatons. 1 en. USA: Chapman an Hall/CRC. [14] Berry, M.W. (2003) Survey of Text Mnng: Clusterng, Classfcaton, an n Retreval. 2 en. New York: Sprnger Publshng. [15] Franke, J., Nakhaezaeh, G. an Renz, I. (2003) Text Mnng: Theoretcal st Aspects an Applcatons. 1 en. Berln: Physca-Verlag HD. [16] Chemuugunta, C. (2010) Text Mnng wth Probablstc Topc Moels: Applcatons n Informaton Retreval an Concept Moelng. Saarbrücken: LAP LAMBERT Acaemc Publshng. [17] Ble, D.M. (2004) Probablstc moels of text an mages. Calforna: UNIVERSITY OF CALIFORNIA. [18] Ble, D.M. (2012) Probablstc topc moels. Communcatons of the ACM, 55 (4), pp. 77-84. [19] Ble, D.M. an Lafferty, J.D. (2009) Topc moels. Text Mnng: Theory an Applcatons, pp. 71-93. [20] Deerwester, S., Dumas, S.T., Furnas, G.W. an Lanauer, T.K. (1990) Inexng by Latent semantc Analyss. Journal of the Amercan Socety for Informaton Scence, 41 (6), pp. 391-407. [21] Torkkola, K. (2002) Dscrmnatve features for ocument classfcaton. 16th Internatonal Conference on Pattern Recognton, 1, pp. 472-475. [22] Foltz, P.W. (1990) Usng latent semantc nexng for nformaton flterng. COCS '90 Proceengs of the ACM SIGOIS an IEEE CS TC-OA conference on Offce nformaton systems, pp. 40-47. [23] McCarey, F., Cnnée, M. an Kushmerck, N. (2006) Recommenng lbrary methos: an evaluaton of the vector space moel (VSM) an latent semantc nexng (LSI). ICSR'06 Proceengs of the 9th nternatonal conference on Reuse of Off-the-Shelf Components, pp. 217-230. [24] Chen, L., Tokua, N. an Naga, A. (2003) A new fferental LSI space-base probablstc ocument classfer. Informaton Processng Letters, 88 (5), pp. 203-212. Hgh Performance Latent Drchlet Allocaton for Text Mnng 169

References [25] Hofmann, T. (1999) Probablstc latent semantc nexng. SIGIR '99 Proceengs of the 22n annual nternatonal ACM SIGIR conference on Research an evelopment n nformaton retreval, pp. 50-57. [26] Hofmann, T. (1999) Probablstc latent semantc analyss. UAI'99 Proceengs of the Ffteenth conference on Uncertanty n artfcal ntellgence, pp. 289-296. [27] Brants, T. (2005) Test Data Lkelhoo for PLSA Moels. Informaton Retreva, 8 (2), pp. 181-196. [28] Buntne, W. (2009) Estmatng Lkelhoos for Topc Moels. ACML '09 Proceengs of the 1st Asan Conference on Machne Learnng: Avances n Machne Learnng, pp. 51-64. [29] Lu, Y., Me, Q.Z. an Zha, C.X. (2011) Investgatng task performance of probablstc topc moels: an emprcal stuy of PLSA an LDA. Informaton Retreval, 14 (2), pp. 178-203. [30] Ble, D.M., Ng, A.Y. an Joran, M.I. (2003) Latent rchlet allocaton. The Journal of Machne Learnng Research, 3, pp. 993-1022. [31] Mnka, T. an Lafferty, J. (2002) Expectaton-propagaton for the generatve aspect moel. UAI'02 Proceengs of the Eghteenth conference on Uncertanty n artfcal ntellgence, pp. 352-359. [32] Ble, D.M. an Joran, M.I. (2006) Varatonal nference for Drchlet process mxtures. Bayesan Analyss, 1 (1), pp. 121-144. [33] Kao, A. an Poteet, S.R. (2010) Natural Language Processng an Text Mnng. st 1 en. New York: Sprnger Publshng. [34] Bellegara, J. (2008) Latent Semantc Mappng: Prncples an Applcatons. San Rafael: Morgan an Claypool Publshers. [35] Zhong, N. an Lu, J.M. (2004) Intellgent Technologes for Informaton Analyss. n 2 en. New York: Sprnger Publshng. [36] Vnokourov, A. an Grolam, M. (2002) A probablstc framework for the herarchc organsaton an classfcaton of ocument collectons. Journal of Intellgent Informaton Systems, 18(2/3), pp. 153 172. [37]Lucas, P., Gámez, J.A. an Ceran, A.S. (2007) Avances n Probablstc st Graphcal Moels. 1 en. New York: Sprnger Publshng. Hgh Performance Latent Drchlet Allocaton for Text Mnng 170

References [38] Chemuugunta, C., Smyth, P. an Steyvers, M. (2007) Moelng general an specfc aspects of ocuments wth a probablstc topc moel. In Proceengs of the Conference on Avances n Neural Informaton Processng Systems (NIPS 07), pp. 241 248. [39] Grolam, M. an Kabán, A. (2003) On an equvalence between PLSI an LDA. SIGIR '03 Proceengs of the 26th annual nternatonal ACM SIGIR conference on Research an evelopment n nformaon retreval, pp. 433-434. [40] Masaa, T., Kyasu, S. an Myahara, S. (2008) Comparng LDA wth plsi as a mensonalty reucton metho n ocument clusterng. LKR'08 Proceengs of the 3r nternatonal conference on Large-scale knowlege resources: constructon an applcaton, pp. 13-26. [41] Dongarra, J., Foster, I., Fox, G.C., Gropp, W., Kenney, K., Torczon, L. an st Whte, A. (2002) The Sourcebook of Parallel Computng. 1 en. Burlngton: Morgan Kaufmann. [42] Grama, A., Karyps, G., Kumar, V. an Gupta, A. (2003) Introucton to Parallel n Computng. 2 en. Boston: Ason-Wesley. [43] Hwang, K., Dongarra, J. an Fox, G.C. (2011) Dstrbute an Clou Computng: From Parallel Processng to the Internet of Thngs. Burlngton: Morgan Kaufmann. [44] Amat, G., D Alos, D., Gannn, V. an Ubaln, F. (1997) A framework for flterng news an managng strbute ata. Journal of Unversal Computer Scence, 3 (8), pp. 1007 1021. [45] Pacheco, P. (2011) An Introucton to Parallel Programmng. Burlngton: Morgan Kaufmann. [46] Gebal, F. (2011) Algorthms an Parallel Computng. Hoboken: Wley-Blackwell. [47] Fountan, T.J. (2006) Parallel Computng: Prncples an Practce. Cambrge: Cambrge Unversty Press. [48] Bschof, C., Bucker, M., Gbbon, P., Joubert, G. an Lppert, T. (2008) Parallel Computng: Archtectures, Algorthms an Applcatons. Amsteram: IOS Press. [49] Chanra, R., Menon, R. Daqum, L., Kohr, D., Mayan, D. an McDonal, J. (2000) Parallel Programmng n OpenMP. Burlngton: Morgan Kaufmann. [50] Tora, S. an Eguch, K. (2013) MPI/OpenMP hybr parallel nference for Latent Hgh Performance Latent Drchlet Allocaton for Text Mnng 171

References Drchlet Allocaton. IEICE TRANSACTIONS on Informaton an Systems, 96 (5), pp. 1006-1015. [51] Dean, J. an Ghemawat, S. (2004) Mapreuce: Smplfe ata processng on large clusters. In Proceengs of the ACM USENIX Symposum on Operatng Systems Desgn an Implentaton (OSDI 04), pp. 137 150. [52] Mner, D. an Shook, A. (2012) MapReuce Desgn Patterns: Bulng Effectve st Algorthms an Analytcs for Haoop an Other Systems. 1 en. Sebastopol: O'Relly Mea. [53] Janert, P.K. (2010) Data Analyss wth Open Source Tools. O'Relly Mea. st 1 en. Sebastopol: [54] Dumbl, E. (2012) Plannng for Bg Data. st 1 en. Sebastopol: O'Relly Mea. [55] Wang, L.Z. et al., (2008) Scentfc Clou Computng: Early Defnton an Experence. 10th IEEE Internatonal Conference on Hgh Performance Computng an Communcatons (HPCC), pp. 825-830. [56] Armbrust, M. et al., (2010) A vew of clou computng. Communcatons of the ACM, 53 (4), pp. 50-58. [57] Zhang, L.J. an Zhou, Q. (2009) CCOA: Clou Computng Open Archtecture. IEEE Internatonal Conference on Web Servces (ICWS), pp. 607-616. [58] Zhang, S., Zhang, S.Z., Chen, X.B. an Huo, X.Z. (2010) Clou Computng Research an Development Tren. Secon Internatonal Conference on Future Networks (ICFN), pp. 93-97. [59] Gong, C.Y. et al., (2010) The Characterstcs of Clou Computng. 2010 39th Internatonal Conference on Parallel Processng Workshops (ICPPW), pp. 275-279. [60] Whte, T. (2010) Haoop: The Defntve Gue. Press. 15 n 2 en. Sunnyvale: Yahoo [61]Perera, S. an Gunarathne, T. (2013) Haoop MapReuce Cookbook. Brmngham: PACKT PUBLISHING. [62] Lam, C. (2010) Haoop n Acton. st 1 en. Greenwch: Mannng Publcatons. [63] Kala Karun, A. an Chtharanjan, K. (2013) A revew on haoop HDFS nfrastructure extensons. 2013 IEEE Conference on Informaton & Communcaton Hgh Performance Latent Drchlet Allocaton for Text Mnng 172

References Technologes (ICT), pp. 132-137. [64] Taylor, R.C. (2010) An overvew of the Haoop/MapReuce/HBase framework an ts current applcatons n bonformatcs. Proceengs of the 11th Annual Bonformatcs Open Source Conference (BOSC), 11 (12), pp. 1-6. [65] Ln, X.L., Meng, Z.D., Xu, C. an Wang, M. (2012) A Practcal Performance Moel for Haoop MapReuce. 2012 IEEE Internatonal Conference on Cluster Computng Workshops, pp. 231-239. [66] Chu, C.T. et al., (2006) Map-Reuce for Machne Learnng on Multcore. Avances n NIPS, 19, pp. 281-288. [67] Brown, R.A. (2009) Haoop at home: large-scale computng at a small college. Proceengs of the 40th ACM techncal symposum on Computer scence eucaton, pp. 106-110. [68] Duan, S.Q., Wu, B., Wang, B. an Yang, J. (2011) Desgn an mplementaton of parallel statatcal algorthm base on Haoop's MapReuce moel. 2011 IEEE Internatonal Conference on Clou Computng an Intellgence Systems (CCIS), pp. 134-138. [69] Gomes, R., Wellng, M. an Perona, P. (2008) Memory boune nference n topc moels. In Proceengs of the Internatonal Conference on Machne Learnng (ICML 08), pp. 344 351. [70] Xe, J. et al., (2010) Improvng MapReuce performance through ata placement n heterogeneous Haoop clusters. 2010 IEEE Internatonal Symposum on Parallel & Dstrbute Processng, Workshops an Ph Forum (IPDPSW), pp. 1-9. [71] Porteous, I. et al., (2008) Fast collapse gbbs samplng for latent rchlet allocaton. In Proceengs of the Internatonal SIGKDD Conference on Knowlege Dscovery an Data Mnng (KDD 08), pp. 569 577. [72] Newman, D., Asuncon, A., Smyth, P. an Wellng, M. (2009) Dstrbute Algorthms for Topc Moels. The Journal of Machne Learnng Research, 10, pp. 1801-1828. [73] Tang, N. (2011) Informaton Extracton from the Internet. Inepenent Publshng Platform. st 1 en. CreateSpace [74] Inurkhya, N. an Damerau, F.J. (2010) Hanbook of Natural Language n Processng. 2 en. USA: Chapman an Hall/CRC. Hgh Performance Latent Drchlet Allocaton for Text Mnng 173

References [75] Yang, Y., Slattery, S. an Ghan, R. (2002) A stuy of approaches to hypertext categorzaton. Journal of Intellgent Informaton Systems, 18(2/3), pp. 219 241. [76] Chen, W.L. et al., (2004) Automatc wor clusterng for text categorzaton usng global nformaton. AIRS'04 Proceengs of the 2004 nternatonal conference on Asan Informaton Retreval Technology, pp. 1-11. [77] Guyon, I. an Elsseeff, A. (2003) An ntroucton to varable an feature selecton. The Journal of Machne Learnng Research, 3, pp. 1157-1182. [78] Yang, Y. an Peersen, J.O. (1997) A Comparatve Stuy on Feature Selecton n Text Categorzaton. Proc. 14th Int l Conf. Machne Learnng, pp. 412-420. [79] Forman, G. (2003) An extensve emprcal stuy of feature selecton metrcs for text classfcaton. The Journal of Machne Learnng Research, 3, pp. 1289-1305. [80] Tenenbaum, J.B., Slva, V. an Langfor, J.C. (2000) A Global Geometrc Framework for Nonlnear Dmensonalty Reucton. Scence, 290, pp. 2319-2323. [81] Yuan, H.Y., Gao, X.J. an Le, F. (2010) The Feature Extracton an Dmenson Reucton Research Base on Weghte FCM Clusterng Algorthm. 2010 Internatonal Conference on Electroncs an Informaton Engneerng (ICEIE), 2, pp-114-117. [82] Grffths, T.L. an Steyvers, M. (2004) Fnng scentfc topcs. The Natonal Acaemy of Scences of the USA, 101 (1), pp. 5228-5235. [83] Alon, N. an Spencer, J.H. (2008) The Probablstc Metho. Wley-Blackwell. r 3 en. Hoboken: [84] Schutze, H., Hull, D. an Peersen, J. (1995). A comparson of classfers an ocument representatons for the routng problem. In Proceengs of SIGIR-95, 18th Internatonal Conference on Research an Developement n Informaton Retreval, pp. 229 237. [85] Lafferty, J.D. an Ble, D.M. (2005) Correlate topc moels. Avances n neural nformaton processng systems, pp. 147-154. [86] Teh, Y.W., Joran, M.I., Beal, M.J. an Ble, D.M. (2007) Herarchcal Drchlet Processes. Journal of the Amercan Statstcal Assocaton, 101 (476), pp. 1566-1581. [87] Zhou, J.Y., Wang, F.Y. an Zeng, D.J. (2011) Herarchcal Drchlet Processes an Ther Applcatons: A Survey. Acta Automatca Snca, 37 (4), pp. 389-407. Hgh Performance Latent Drchlet Allocaton for Text Mnng 174

References [88] Teh, Y.W., Joran, M.I., Beal, M.J. an Ble, D.M. (2005) Sharng Clusters among Relate Groups: Herarchcal Drchlet Processes. Avances n Neural Informaton Processng Systems, pp. 1385-1392. [89] Zhu, X.Q., Wu, X.D. an Chen, Q.J. (2003) Elmnatng class nose n large atasets. In Proceengs of the Twenteth Internatonal Conference on Machne Learnng, pp. 920-927. [90] Gamberger, D., Lavrac, N. an Dzerosk, S. (2000) Nose etecton an elmnaton n ata preprocessng: Experments n mecal omans. Apple Artfcal Intellgence: An Internatonal Journal, 14 (2), pp. 205-223. [91] Ramaswamy, S., Rastog, R. an Shm, K. (2000) Effcent algorthms for mnng outlers from large ata sets. SIGMOD '00 Proceengs of the 2000 ACM SIGMOD nternatonal conference on Management of ata, 29 (2), pp. 427-438. [92] Lu, D.X., Xu, W.R. an Hu, J.N. (2009) A feature-enhance smoothng metho for LDA moel apple to text classfcaton. Internatonal Conference on Natural Language Processng an Knowlege Engneerng, pp. 1-7. [93] Zhou, D. an Wae, V. (2009) Smoothng methos an cross-language ocument re-rankng. CLEF'09 Proceengs of the 10th cross-language evaluaton forum conference on Multlngual nformaton access evaluaton: text retreval experments, pp. 62-69. [94] Garca, M., Halgo, H. an Chavez, E. (2006) Contextual Entropy an Text Categorzaton. Fourth Latn Amercan Web Congress, pp. 147-153. [95] Xu, L. (2003) Data smoothng regularzaton, mult-sets-learnng, an problem solvng strateges. Neural Networks - 2003 Specal ssue: Avances n neural networks research, 16 (5-6), pp. 817-825. [96] Ja, R., Tao, T. an Yng, Y. (2012) Bayesan Networks Parameter Learnng Base on Nose Data Smoothng n Mssng Informaton. ISCID '12 Proceengs of the 2012 Ffth Internatonal Symposum on Computatonal Intellgence an Desgn, 1, pp. 136-139. [97] Zhang, X.W., Pars-Prescce, F., Sanhu, R. an Park, J. (2005) Formal moel an polcy specfcaton of usage control. ACM Transactons on Informaton an System Securty (TISSEC), 8 (4), pp. 351-387. [98] Arora, S. (2012) Learnng Topc Moels - Gong beyon SVD. 2012 IEEE 53r Annual Symposum on Founatons of Computer Scence (FOCS), pp. 1-10. Hgh Performance Latent Drchlet Allocaton for Text Mnng 175

References [99] Sebastan, F. (2002) Machne learnng n automate text categorzaton. ACMComputng Surveys, 34(1), pp. 1-47. [100] Bshop, C.M. (2006) Pattern recognton an machne learnng. New York: Sprnger Publshng. [101] Ha-Thuc, V. an Srnvasan, P. (2008) Topc moels an a revst of text-relate applcatons. PIKM '08 Proceengs of the 2n PhD workshop on Informaton an knowlege management, pp. 25-32. [102] Lertnattee, V. an Theeramunkong, T. (2004) Effect of term strbutons on centro-base text categorzaton. Informatcs an computer scence ntellgent systems, 158 (1), pp. 89-115. [103] Debole, F. an Sebastan, F. (2003) Supervse term weghtng for automate text categorzaton. SAC '03 Proceengs of the 2003 ACM symposum on Apple computng, pp. 784-788. [104] Xue, D.J. an Sun, M.S. (2003) Chnese text categorzaton base on the bnary weghtng moel wth non-bnary smoothng. ECIR'03 Proceengs of the 25th European conference on IR research, pp. 408-419. [105] Heath, M.T. (1997) Scentfc computng an ntrouctory survey. New York: McGraw-Hll. [106] Zeng, J., Cheung, W.K. an Lu, J.M. (2013 ) Learnng Topc Moels by Belef Propagaton. IEEE Transactons on Pattern Analyss an Machne Intellgence, 35 (5), pp. 1121-1134. [107] Wang, C., Thesson, B., Meek, C. an Ble, D.M. (2009) Markov topc moels. The Twelfth Internatonal Conference on Artfcal Intellgence an Statstcs (AISTATS), pp. 583-590. [108] Casella, G. an George, E.I. (1992) Explanng the Gbbs sampler. The Amercan Statstcan, 46 (3), pp. 167-174. [109] Bg, B. (2003) Usng Kullback-Lebler stance for text categorzaton. ECIR'03 Proceengs of the 25th European conference on IR research, pp. 305-319. [110] Kullback, S. an Lebler, R.A. (1951) On Informaton an Suffcency. The annals of Mathematcal Statstcs, 22 (1), pp. 79-86. [111] Henrch, G. (2009) Parameter Estmaton for Text Analyss. Techncal Report No. 09RP008-FIGD, Fraunhofer Insttute for Computer Graphcs, pp. 1-32. Hgh Performance Latent Drchlet Allocaton for Text Mnng 176

References [112] Bertsekas, D. P. (1999) Nonlnear Programmng. Cambrge: Athena Scentfc. [113] Bonnans, J. F. (2006) Numercal optmzaton: theoretcal an practcal aspects. New York: Sprnger Publshng. [114] Kschschang, F.R., Frey, B.J. an Loelger, H.A. (2001) Factor graphs an the sum-prouct algorthm. IEEE Transactons on Informaton Theory, 47 (2), pp. 498-519. [115] Neal, R.M. (1993) Probablstc nference usng Markov chan Monte Carlo methos. Toronto: Unversty of Toronto. [116] Seneta, E. (2006) Non-negatve Matrces an Markov Chans. York: Sprnger Publshng. n 2 en. New [117] Newman, D., Asuncon, A., Smyth, P. an Wellng, M. (2007) Dstrbute nference for latent rchlet allocaton. In Proceengs of the Conference on Avances n Neural Informaton Processng Systems (NIPS 07), pp. 1081 1088. [118] Mtchell, M. (1998) An Introucton to Genetc Algorthms. Cambrge: MIT Press. [119] Golberg, D.E. (1989) Genetc Algorthms n Search, Optmzaton an st Machne Learnng. 1 en. Boston: Ason Wesley. [120] Affenzeller, M., Wnkler, S., Waqner, S. an Beham, A. (2009) Genetc Algorthms an Genetc Programmng: Moern Concepts an Practcal Applcatons. st 1 en. USA: Chapman & Hall/CRC. [121] Svananam, S.N. an Deepa, S.N. (2007) Introucton to Genetc Algorthms. n 2 en. New York: Sprnger Publshng. [122] Srnvas, M. an Patnak, L.M. (1994) Genetc algorthms: a survey. Computer, 27 (6), pp. 17-26. [123] Potts, J.C., Gens, T.D. an Yaav, S.B. (1994) The evelopment an evaluaton of an mprove genetc algorthm base on mgraton an artfcal selecton. IEEE Transactons on Systems, Man an Cybernetcs, 24 (1), pp. 73-86. [124] Burjorjee, K.M. (2013) Explanng optmzaton n genetc algorthms wth unform crossover. FOGA XII '13 Proceengs of the twelfth workshop on Founatons of genetc algorthms XII, pp. 37-50. [125] Lobo, F.G. an Lma, C.F. (2005) A revew of aaptve populaton szng Hgh Performance Latent Drchlet Allocaton for Text Mnng 177

References schemes n genetc algorthms. GECCO '05 Proceengs of the 2005 workshops on Genetc an evolutonary computaton, pp. 228-234. [126] Mqalas, A., Paralos, P.M. an Storøy, S. (2012) Parallel Computng n Optmzaton. Berln: Kluwer Acaemc Publshers. [127] Shokrpour, A. an Othman, M. (2009) Survey on Dvsble Loa Theory an ts Applcatons. Internatonal Conference on Informaton Management an Engneerng (ICIME) 2009, pp. 300-304. [128] Bharawaj, V., Ghose, D. an Robertazz, T. (2003) Dvsble Loa Theory: A New Paragm for Loa Scheulng n Dstrbute Systems. Cluster Computng, 6 (1), pp. 7-17. [129] Robertazz, T.G. (2003) Ten reasons to use vsble loa theory. Computer, 36 (5), pp. 63-68. [130] L, X., Veeravall, B. an Ch, C.K. (2001) Dvsble loa scheulng on a hypercube cluster wth fnte-sze buffers an granularty constrants. Frst IEEE/ACM Internatonal Symposum on Cluster Computng an the Gr, pp. 660-667. [131] Heen, E.M., Fujmoto, N. an Haghara, K. (2008) Statc Loa Dstrbuton for Communcaton Intensve Parallel Computng n Multclusters. 16th Euromcro Conference on Parallel, Dstrbute an Network-Base Processng, pp. 321-328. [132] Noel, P.A. an Ganesan, S. (2010) Performance analyss of Dvsble Loa Scheulng utlzng Mult-Installment Loa Dstrbuton wth varyng szes of result loa fractons. 2010 IEEE Internatonal Conference on Electro/Informaton Technology (EIT), pp. 1-6. [133] Ghatpane, A., Nakazato, H., Watanabe, H. an Beaumont, O. (2008) Dvsble Loa Scheulng wth Result Collecton on Heterogeneous Systems. IEEE Internatonal Symposum on Parallel an Dstrbute Processng (IPDPS) 2008, pp. 1-8. [134] Sohn, J. an Robertazz, T.G. (1998) Optmal tme-varyng loa sharng for vsble loas. IEEE Transactons on Aerospace an Electronc Systems, 34 (3), 907-923. [135] Ln, X., Lu, Y., Deogun, J. an Goar, S. (2007) Real-Tme Dvsble Loa Scheulng for Cluster Computng. 13th IEEE Real Tme an Embee Technology an Applcatons Symposum (RTAS), pp. 303-314. [136] Yang, Y., Zhang, J. an Ksel, B. (2003) A scalablty analyss of classfers n Hgh Performance Latent Drchlet Allocaton for Text Mnng 178

References text categorzaton. Proceengs of SIGIR-03, 26th ACM Internatonal Conference on Research an Development n Informaton Retreval, pp. 96 103. [137] Grover, C. et al., (2004) A Framework for Text Mnng Servces. School of Informatcs, Unversty of Enburgh an Natonal e-scence Centre, Enburgh [Onlne]. Avalable at: http://www.ltg.e.ac.uk/np/publcatons/ltg/papers/grover2004framework.pf [Accesse 26 October 2011]. [138] Salton, G., Wong, A. an Yang, C.S. (1997) A vector space moel for automatc nexng. Reangs n nformaton retreval, pp. 273-280. [139] Shekholeslam, G., Chatterjee, S. an Zhang, A. (1998) Wavecluster: A mult-resoluton clusterng approach for very large spatal atabases. Proceengs of the 24th VLDB Conference, pp. 428-439. [140] Sun, C.K., Gao, B., Cao, Z.F. an L, H. (2008) HTM: a topc moel for hypertexts. EMNLP '08 Proceengs of the Conference on Emprcal Methos n Natural Language Processng, pp. 514-522. [141] Joachms, T. (1999) Makng large-scale support vector machne learnng practcal. In Avances n kernel methos: support vector learnng, pp. 169-184. [142] Tong, S. an Koller, D. (2002) Support vector machne actve learnng wth applcatons to text classfcaton. Machne Learnng Research, 2 (3/1), pp. 45-66. [143] Lu, T.Y. (2005) Support vector machnes classfcaton wth a very large-scale taxonomy. ACM SIGKDD Exploratons Newsletter - Natural language processng an text mnng, 7 (1), pp. 36-43. [144] Joachms, T. (1998) Text categorzaton wth support vector machnes: learnng wth many relevant features. Proceengs of ECML-98, 10th European Conference on Machne Learnng, pp. 137 142. [145] Joachms, T. (2001) A statstcal learnng learnng moel of text classfcaton for support vector machnes. SIGIR '01 Proceengs of the 24th annual nternatonal ACM SIGIR conference on Research an evelopment n nformaton retreval, pp. 128-136. [146] Tan, S.B. et al., (2005) A novel refnement approach for text categorzaton. CIKM '05 Proceengs of the 14th ACM nternatonal conference on Informaton an knowlege management, pp. 469-476. [147] Xue, X. an Zhou, Z. (2009) Dstrbutonal Features for Text Categorzaton. Hgh Performance Latent Drchlet Allocaton for Text Mnng 179

References IEEE Transactons on Knowlege an Data Engneerng, 21 (3), pp. 428-442. [148] Goncalves, T. an Quaresma, P. (2006) Evaluatng preprocessng technques n a Text Classfcaton problem. pp. 841-850, [Onlne]. Avalable at: http://cteseerx.st.psu.eu/vewoc/ownloa?o=10.1.1.131.1271&rep=rep1&type= pf [Accesse 26 October 2011]. [149] Srvhya, V. an Antha, R. (2010) Evaluatng Preprocessng Technques n Text Categorzaton. Internatonal Journal of Computer Scence an Applcaton Issue 2010, pp. 49-51. [150] Schefele, U. (1996) Topc Interest, Text Representaton, an Qualty of Experence. Contemporary Eucatonal Psychology, 21 (1), pp. 3-18. [151] Frank, S.L., Koppen, M., Noorman, L.G.M. an Vonk, W. (2007) Moelng Multple Levels of Text Representaton. Hgher level language processes n the bran: nference an comprehenson processes, pp. 133-157. [152] Dumas, S.T., Platt, J., Heckerman, D. an Saham, M. (1998) Inuctve learnng algorthms an representatons for text categorzaton. Proceengs of CIKM-98, 7th ACM Internatonal Conference on Informaton an Knowlege Management, pp. 148 155. [153] Yan, J. et al., (2006) Effectve an Effcent Dmensonalty Reucton for Large-Scale an Streamng Data Preprocessng. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 18 (2), pp. 1-14. [154] Conra, J.G. an Utt, M.H. (1994) A system for scoverng relatonshps by feature extracton from text atabases. SIGIR '94 Proceengs of the 17th annual nternatonal ACM SIGIR conference on Research an evelopment n nformaton retreval, pp. 260-270. [155] Mooney, R.J. an Nahm, U.Y. (2003) Text Mnng wth Informaton Extracton. Proceengs of the 4th Internatonal MIDP Colloquum, pp.141-160. [156] Drucker, H., Vapnk, V. an Wu, D. (1999) Support vector machnes for spam categorzaton. IEEE Transactons on Neural Networks, 10(5), pp. 1048 1054. [157] Lews, D.D. (1995) Evaluatng an optmzng autonomous text classfcaton systems. SIGIR '95 Proceengs of the 18th annual nternatonal ACM SIGIR conference on Research an evelopment n nformaton retreval, pp. 246-254. [158] Halk, M., Vazrganns, M. an Batstaks, Y. (2000) Qualty Scheme Assessment n the Clusterng Process. PKDD '00 Proceengs of the 4th European Hgh Performance Latent Drchlet Allocaton for Text Mnng 180

References Conference on Prncples of Data Mnng an Knowlege Dscovery, pp. 265-276. [159] Anreu, C., Fretas, N.D., Doucet, A. an Joran, M. (2003) An Introucton to MCMC for Machne Learnng. Machne Learnng In Machne Learnng, 50 (1-2), pp. 5-43. [160] Chang, C.C. an Ln, C.J. (2011) LIBSVM: A lbrary for support vector machnes. ACM Transactons on Intellgent Systems an Technology (TIST), 2 (3), pp. 1-39. [161] LIBSVM -- A Lbrary for Support Vector Machnes (2011) Avalable at: http://www.cse.ntu.eu.tw/~cjln/lbsvm/ [Accesse: 27 May 2011]. [162] Apte, C., Damerau, F. an Wess, S. (1994). Automate learnng of ecson rules for text categorzaton. ACM Transactons on Informaton Systems, 12 (3):233 251. [163] Jo, T. an Jo, G.S. (2008) Table Base Sngle Pass Algorthm for Clusterng Electronc Documents n 20NewsGroups. IEEE Internatonal Workshop on Semantc Computng an Applcatons, pp. 66-71. [164] MIT Computer Scence an Artfcal Intellgence Laboratory: Twenty news groups ataset (2011) Avalable at: http://qwone.com/~jason/20newsgroups/ [Accesse: 03 Aprl 2011]. [165] McCallum, A., Mmno, D.M. an Wallach, H.M. (2009) Rethnkng LDA: Why Prors Matter. Avances n Neural Informaton Processng Systems, pp. 1973-1981. [166] Hsu, C.W. an Ln, C.J. (2002) A comparson of methos for multclass support vector machnes. IEEE Transactons on Neural Networks, 13 (2), pp. 415-425. [167] L, W. an McCallum, A. (2006) Pachnko allocaton: DAG-structure mxture moels of topc correlatons. ICML '06 Proceengs of the 23r nternatonal conference on Machne learnng, pp. 577-584. [168] Yakhnenko, O. an Honavar, V. (2009) Mult-moal herarchcal Drchlet process moel for prectng mage annotaton an mage-object label corresponence. Proceengs of the SIAM Internatonal Conference on Data Mnng, pp. 281-294. [169] Karyps, G., Han, E.H. an Kumar, V. (1999) Chameleon: herarchcal clusterng usng ynamc moelng. Computer, 32 (8), pp. 68-75. [170] Qan, W.N., Gong, X.Q. an Zhou, A.Y. (2003) Clusterng n very large atabases base on stance an ensty. Journal of Computer Scence an Technology, Hgh Performance Latent Drchlet Allocaton for Text Mnng 181

References 18 (1), pp. 67-76. [171] Zhou, S.G. et al., (2000) Combnng Samplng Technque wth DBSCAN Algorthm for Clusterng Large Spatal Databases. PADKK '00 Proceengs of the 4th Pacfc-Asa Conference on Knowlege Dscovery an Data Mnng, Current Issues an New Applcatons, pp. 169-172. [172] Ester, M., Kregel, H.P., Saner, J. an Xu, X.W. (1996) A ensty-base algorthm for scoverng clusters n large spatal atabases wth nose. In Secon Internatonal Conference on Knowlege Dscovery an Data Mnng, pp. 226-231. [173] Ashour, W. an Sunoallah, S. (2011) Mult ensty DBSCAN. IDEAL'11 Proceengs of the 12th nternatonal conference on Intellgent ata engneerng an automate learnng, pp. 446-453. [174] Borah, B. an Bhattacharyya, D.K. (2004) An mprove samplng-base DBSCAN for large spatal atabases. Proceengs of Internatonal Conference on Intellgent Sensng an Informaton Processng, pp. 92-96. [175] Duan, L. et al., (2007) A local-ensty base spatal clusterng algorthm wth nose. Informaton Systems, 32 (7), pp. 978-986. [176] Van Kley, E.R. (1994) Data smoothng algorthm. Poston Locaton an Navgaton Symposum, pp. 323-328. [177] Sebes, A. an Kersten, R. (2012) Smoothng categorcal ata. ECML PKDD'12 Proceengs of the 2012 European conference on Machne Learnng an Knowlege Dscovery n Databases, pp. 42-57. [178] Shannon, C.E. (2001) A mathematcal theory of communcaton. ACM SIGMOBILE Moble Computng an Communcatons Revew, 5 (1), pp. 3-55. [179] L, W.B., Sun, L., Feng, Y.Y. an Zhang, D.K. (2008) Smoothng LDA moel for text categorzaton. AIRS'08 Proceengs of the 4th Asa nformaton retreval conference on Informaton retreval technology, pp. 83-94. [180] Ngam, K., Lafferty, J. an McCallum, A. (1999) Usng Maxmum Entropy for Text Classfcaton. IJCAI-99 Workshop on Machne Learnng for Informaton Flterng, pp. 61-67. [181] Fu, L. an Hou, Y.X. (2009) Usng non-extensve entropy for text classfcaton. ICIC'09 Proceengs of the 5th nternatonal conference on Emergng ntellgent computng technology an applcatons, pp. 908-917. Hgh Performance Latent Drchlet Allocaton for Text Mnng 182

References [182] Halk, M., Batstaks, Y. an Vazrganns, M. (2001) On Clusterng Valaton Technques. Journal of Intellgent Informaton Systems, 17 (2-3), pp. 107-145. [183] Huang, H., B, L.P., Song, H.T. an Lu, Y.C. (2005) A varatonal EM algorthm for large atabases. Proceengs of 2005 Internatonal Conference on Machne Learnng an Cybernetcs, 5, pp. 3048-3052. [184] Nallapat, R., Cohen, W. an Lafferty, J. (2007) Parallelze Varatonal EM for Latent Drchlet Allocaton: An Expermental Evaluaton of Spee an Scalablty. Seventh IEEE Internatonal Conference on Data Mnng Workshops, pp. 349-354. [185] Wolfe, J., Haghgh, A. an Klen, D. (2008) Fully strbute EM for very large atasets. ICML '08 Proceengs of the 25th nternatonal conference on Machne learnng, pp. 1184-1191. [186] Apache Mahout: Scalable machne learnng an ata mnng (2012) Avalable at: http://mahout.apache.org/ [14 January 2012]. [187] Zha, K., Boy-Graber, J., Asa, N. an Alkhouja, M.L. (2012) Mr. LDA: a flexble large scale topc moelng package usng varatonal nference n MapReuce. Proceengs of the 21st nternatonal conference on Worl We Web, pp. 879-888. [188] Smola, A. an Narayanamurthy, S. (2010) An archtecture for parallel topc moels. Proceengs of the VLDB Enowment, 3 (1-2), pp. 703-710. [189] Yahoo! Haoop Tutoral (2012) Avalable at: http://eveloper.yahoo.com/haoop/tutoral/ [10 February 2012]. [190] Dou, W.W., Wang, X.Y., Chang, R. an Rbarsky, W. (2011) ParallelTopcs: A probablstc approach to explorng ocument collectons. 2011 IEEE Conference on Vsual Analytcs Scence an Technology (VAST), pp. 231-240. [191] Zhang, D. et al., (2010) PTM: probablstc topc mappng moel for mnng parallel ocument collectons. CIKM '10 Proceengs of the 19th ACM nternatonal conference on Informaton an knowlege management, pp. 1653-1656. [192] Intel Threang Bulng Blocks (Intel TBB) (2012) Avalable at: https://www.threangbulngblocks.org/ [15 February 2012]. [193] The Internet Communcatons Engne (ICE) (2012) Avalable at: http://www.zeroc.com/ce.html [18 February 2012]. [194] Asuncon, A., Smyth, P. an Wellng, M. (2008) Asynchronous strbute learnng of topc moels. In Proceengs of the Conference on Avances n Neural Hgh Performance Latent Drchlet Allocaton for Text Mnng 183

References Informaton Processng Systems, pp. 1-8. [195] Wang, Y., Ba, H., Stanton, M., Chen, W. an Chang, E. (2009) PLDA: Parallel latent rchlet allocaton for large-scale applcatons. In Algorthmc Aspects n Informaton an Management, pp. 301 314. [196] Apache Haoop (2012) Avalable at: http://haoop.apache.org/ [28 January 2012]. [197] Guo, P.G., Wang, X.Z. an Han, Y.S. (2010) The enhance genetc algorthms for the optmzaton esgn. 2010 3r Internatonal Conference on Bomecal Engneerng an Informatcs (BMEI), 7, pp. 2990-2994. [198] Jang, W.J., Luo, D.T., Xu, Y.S. an Sun, X.M. (2004) Hybr genetc algorthm research an ts applcaton n problem optmzaton. Ffth Worl Congress on Intellgent Control an Automaton (WCICA) 2004, 3, pp. 2122-2126. [199] Yaz, H.M. (2013) Genetc Algorthms In Desgnng An Optmzng Of Structures. Saarbrücken: LAP LAMBERT Acaemc Publshng. [200] Na, F. an Khaer, A.T. (2011) A parameter-less genetc algorthm wth customze crossover an mutaton operators. GECCO '11 Proceengs of the 13th annual conference on Genetc an evolutonary computaton, pp. 901-908. [201] Vafaee, F., Turán, G. an Nelson, P.C. (2010) Optmzng genetc operator rates usng a markov chan moel of genetc algorthms. GECCO '10 Proceengs of the 12th annual conference on Genetc an evolutonary computaton, pp. 721-728. [202] Hammou, S., L, M.Z., Lu, Y. an Alham, N.K. (2010) MRSm: A screte event base MapReuce smulator. 2010 Seventh Internatonal Conference on Fuzzy Systems an Knowlege Dscovery (FSKD), 6, pp. 2993-2997. [203] Lu, Y., L, M.Z., Alham, N.K. an Hammou, S. (2013) HSm: A MapReuce smulator n enablng Clou Computng. Future Generaton Computer Systems, 29 (1), pp. 300-308. [204] Kumar, P. an Verma, A. (2012) Scheulng usng mprove genetc algorthm n clou computng for nepenent tasks. ICACCI '12 Proceengs of the Internatonal Conference on Avances n Computng, Communcatons an Informatcs, pp. 137-142. [205] Aamuthe, A.C. an Bchkar, R.S. (2011) Mnmzng job completon tme n gr scheulng wth resource an tmng constrants usng genetc algorthm. ICWET '11 Proceengs of the Internatonal Conference & Workshop on Emergng Trens n Hgh Performance Latent Drchlet Allocaton for Text Mnng 184

References Technology, pp. 338-343. [206] Ramrez, A.J., Knoester, D.B., Cheng, B.H.C. an McKnley, P.K. (2009) Applyng genetc algorthms to ecson makng n autonomc computng systems. ICAC '09 Proceengs of the 6th nternatonal conference on Autonomc computng, pp. 97-106. [207] Ngam, K., McCallum, A.K., Thrun, S. an Mtchell, T.M. (2000) Text classfcaton from labele an unlabele ocuments usng EM. Machne Learnng, 39(2/3), pp. 103 134. [208] Ble, D.M. an McAulffe, J.D. (2010) Supervse topc moels. Arxv preprnt arxv:1003.0783, pp. 1-22. [209] Perotte, A.J., Woo, F., Elhaa, N. an Bartlett, N. (2011) Herarchcally Supervse Latent Drchlet Allocaton. Avances n Neural Informaton Processng Systems, pp. 2609-2617. [210] Zhu, J., Ahme, A. an Xng, E.P. (2012) MeLDA: maxmum margn supervse topc moels. The Journal of Machne Learnng Research, 13 (1), pp. 2237-2278. [211] Rosen-Zv, M., Grffths, T., Steyvers, M. an Smyth, P. (2004) The author-topc moel for authors an ocuments. UAI '04 Proceengs of the 20th conference on Uncertanty n artfcal ntellgence, pp. 487-494. [212] Rosen-Zv, M. et al., (2010) Learnng author-topc moels from text corpora. ACM Transactons on Informaton Systems (TOIS), 28 (1), pp. 401-438. [213] Steyvers, M., Smyth, P., Rosen-Zv, M. an Grffths, T. (2004) Probablstc author-topc moels for nformaton scovery. Proceengs of the tenth ACM SIGKDD nternatonal conference on Knowlege scovery an ata mnng, pp. 306-315. [214] Ramage, D., Hall, D., Nallapat, R. an Mannng, C.D. (2009) Labele LDA: a supervse topc moel for cret attrbuton n mult-labele corpora. EMNLP '09 Proceengs of the 2009 Conference on Emprcal Methos n Natural Language Processng, 1 (1), pp. 248-256. [215] Wang, Tao., Yn, G., L, X. an Wang, H.M. (2012) Labele topc etecton of open source software from mnng mass textual project profles. SoftwareMnng '12 Proceengs of the Frst Internatonal Workshop on Software Mnng, pp. 17-24. [216] Lu, Z.Y., Zhang, Y.Z., Chang, E.Y. an Sun, M.S. (2011) PLDA+: Parallel Hgh Performance Latent Drchlet Allocaton for Text Mnng 185

References latent rchlet allocaton wth ata placement an ppelne processng. ACM Transactons on Intellgent Systems an Technology (TIST), 2 (3), pp. 2601-2618. [217] Chang, E.Y., Ba, H.J. an Zhu, K.H. (2009) Parallel algorthms for mnng large-scale rch-mea ata. MM '09 Proceengs of the 17th ACM nternatonal conference on Multmea, pp. 917-918. [218] Shu, W. an Kale, L.V. (1989) A ynamc scheulng strategy for the Chare-Kernel system. Proceengs of the 1989 ACM/IEEE Conference on Supercomputng, pp. 389-398. [219] Vswanathan, S., Veeravall, B., Yu, D.T. an Robertazz, T.G. (2004) Desgn an analyss of a ynamc scheulng strategy wth resource estmaton for large-scale gr systems. Ffth IEEE/ACM Internatonal Workshop on Gr Computng, pp. 163-170. [220] Carño, R.L. an Bancescu, I. (2008) Dynamc loa balancng wth aaptve factorng methos n scentfc applcatons. The Journal of Supercomputng, 44 (1), pp. 41-63 Rotaru, C. Bruaru, O. (2012) Dynamc segregatve genetc algorthm for optmzng the varable orerng of ROBDDs. GECCO '12 Proceengs of the fourteenth nternatonal conference on Genetc an evolutonary computaton conference, pp. 657-664. [221] Kamea, H., Fathy, E.Z.S., Ryu, I. an L, J. (2000) A performance comparson of ynamc vs. statc loa balancng polces n a manframe-personal computer network moel. Proceengs of the 39th IEEE Conference on Decson an Control, 2, pp. 1415-1420. [222] Srnvas, M. an Patnak, L.M. (1994) Aaptve probabltes of crossover an mutaton n genetc algorthms. IEEE Transactons on Systems, Man an Cybernetcs, 24 (4), pp. 656-667. [223] Yang S.X. (2008) Genetc algorthms wth memory-an eltsm-base mmgrants n ynamc envronments. Evolutonary Computaton, 16 (3), pp. 385-416. [224] Mlton, J. an Kenney, P.J. (2010) Statc an ynamc selecton threshols governng the accumulaton of nformaton n genetc algorthms usng ranke populatons. Evolutonary Computaton, 18 (2), pp. 229-254. [225] Peng, J.J. et al., (2009) Comparson of Several Clou Computng Platforms. 2009 Secon Internatonal Symposum on Informaton Scence an Engneerng (ISISE), pp. 23-27. Hgh Performance Latent Drchlet Allocaton for Text Mnng 186

References [226] Dkaakos, M.D. et al., (2009) Clou Computng: Dstrbute Internet Computng for IT an Scentfc Research. IEEE Internet Computng, 13 (5), pp. 10-13. [227] Buyya, R. et al., (2009) Clou computng an emergng IT platforms: Vson, hype, an realty for elverng computng as the 5th utlty. Future Generaton Computer Systems, 25 (6), pp. 599-616. [228] Wang, X.G. an Grmson, E. (2007) Spatal Latent Drchlet Allocaton. Avances n NIPS, 20, pp. 1577-1584, [Onlne]. Avalable at: http://machnelearnng.wustl.eu/mlpapers/paper_fles/nips2007_102.pf [Accesse 14 March 2013] [229] Fe-Fe, L. an Perona, P. (2005) A Bayesan herarchcal moel for learnng natural scene categores. IEEE Computer Socety Conference on Computer Vson an Pattern Recognton (CVPR) 2005, 2, pp. 524-531. [230] Vauva, C., Gavat, I. an Datcu, M. (2013) Latent Drchlet Allocaton for Spatal Analyss of Satellte Images. IEEE Transactons on Geoscence an Remote Sensng, 51 (5), pp. 2770-2786. Hgh Performance Latent Drchlet Allocaton for Text Mnng 187