# Beyond Myopic Inference in Big Data Pipelines

Save this PDF as:

Size: px
Start display at page:

## Transcription

3 X Z 1 Z 2 Y θ 1 θ 2 θ 3 θ n Figure 5: Schematic of a general pipeline. Going beond the composition of two components, an general pipeline can be viewed as a Directed Acclic Graph, where the nodes are the black box components and the edges denote which outputs are piped as inputs to which components. The basic idea illustrated above for a linear twostep pipeline can be extended to general pipelines such as the one illustrated in Figure 5. An general pipeline structured as a DAG admits a topological ordering. Denoting the input to the pipeline as X, eventual output Y, and the output of each intermediate stage i in this topological ordering as Z i, then an acclic pipeline is a subgraph of the graph shown in Figure 5. From our modeling assumption, this DAG is an equivalent representation of a directed graphical model: each of X, Z 1, Z 2... Y are nodes denoting random variables, and the components enforce conditional independences. The graphical model corresponding to Figure 5 makes the fewest independence assumptions across stages. Analogous to the two stage pipeline from above, the Baes Optimal prediction Baes for general pipelines is Baes = argmax ( z 3 z 1 Pr(z 1 x, θ 1) z 2 Pr(z2 x, z 1, θ 2) ) Pr(z 3 x,z 1,z 2,θ 3)...Pr( x,z 1...z n 1,θ n)... ]] (3) Computing the Baes Optimal Baes is infeasible for realworld pipelines because it requires summing up over all possible intermediate output combinations z 1,... z n 1. We will therefore explore in the following section how this intractable sum can be approximated efficientl using the information that the Top-k API provides. 2.2 Canonical Inference Even the naive model of passing onl the top output to the next stage can be viewed as a crude approximation of the Baes Optimal output. We call this procedure the Canonical Inference algorithm. For simplicit, consider again the two component pipeline, such that the naive model results in z 1 (1) = argmax z 1 Pr(z 1 x, θ 1) Can = argmax Pr( z (1) 1, θ 2) (4) The optimization over in (4) is unaffected if it is multiplied b the term Pr(z 1 (1) x, θ 1). Once the summation in the marginal in (1) is restricted to z 1 (1), the sum reduces to Pr( z 1 (1), θ 2) Pr(z 1 (1) x, θ 1) Pr( x). Hence, this objective provides a lower bound on the true Pr( x). It is eas to see that this holds true for general pipeline architectures too, where the canonical inference procedure proceeds stage-b-stage through the topological ordering. For i = 1,..., n, define: z i (1) = argmax z i Pr(z i x, z 1 (1),..., z i 1 (1), θ i) Can = argmax Pr( x, z (1) 1,..., z (1) n 1, θ n) (5) then Pr( z 1 (1),...,z n 1 (1),θ n)... Pr(z 1 (1) x,θ 1) Pr( x). In this sense, canonical inference is optimizing a crude lower bound to the Baes Optimal objective (again, the terms Pr(z 1 (1) x, θ 1) etc. do not affect the optimization over ). This suggests how inference can be improved, namel b computing improved lower bounds on Pr( x). This is the direction we take in Section 3, where we exploit the Top-k API and explore several was for making the computation of this bound more efficient. 3. EFFICIENT INFERENCE IN PIPELINES Canonical inference and full Baesian inference are two extremes of the efficienc vs. accurac tradeoff. We now explore how to strike a balance between the two when pipeline components expose the Top-k API. In particular, given input x suppose each component θ is ] capable of producing a list of k outputs (1), (2),... (k) along with their conditional ] probabilities Pr( (1) x, θ),... Pr( (k) x, θ) (in decreasing order). We now outline three algorithms, characterize their computational efficienc, and describe their implementation in terms of relational algebra. This interpretation enables direct implementation of these algorithms on top of traditional/probabilistic database sstems. 3.1 Top-k Inference The natural approach to approximating the full Baesian marginal (Eq. 3) when computation is limited to a few samples is to choose the samples as the top-k realizations of the random variable, weighted b their conditional probabilities. We call this Top-k Inference, which proceeds in an analogous fashion to canonical inference, stage-b-stage through the topological ordering. However, inference now becomes more involved. Each component produces a scored list, and each item in the lists generated b predecessors will have a corresponding scored list as output. Denote the list of outputs returned b the component θ i (each of which are possible values for the random variable Z i in the equivalent graphical model) as {z i} k. Note that this set depends on the inputs to θ i, but we omit this dependence from the notation to reduce clutter. This gives rise to the following optimization problem for identifing the eventual output of the pipeline. TopK = argmax {z 3 } k {z 1 } k Pr(z 1 x, θ 1) {z 2 } k Pr(z2 x, z 1, θ 2) ( ) Pr(z 3 x,z 1,z 2,θ 3)...Pr( x,z 1,...z n 1,θ n)... ]] (6) A simple extension of top-k inference is to have a parameter k i for each component θ i that allows variable length lists of outputs {z i} ki for different components in the pipeline. We refer to these parameters collectivel as k. Consider again the simple two stage pipeline in Figure 4. Each component now operates on relational tables and generates relational tables. We initialize the table T 0 with the input to the pipeline x:

4 IX 0 V ALUE 0 SCORE 0 1 x 1 The subscripts in the attributes refers to the component ID, IX denotes the index into the scored list of outputs produced b the component, VALUE is the actual output and SCORE is its conditional score. Now, given a parameter vector k = (k 1, k 2), the first component will generate the table T 1. IX 0 IX 1 V ALUE 1 SCORE z 1 (1) s x,1 (z 1 (1) ) 1 2 z 1 (2) s x,1 (z 1 (2) ) k 1 z 1 (k 1 ) s x,1 (z 1 (k 1 ) ) Note that IX 0 carries no additional information, if component 1 is the onl successor to component 0 (as is the case in this example). However, in a general pipeline, we need columns like IX 0 to uniquel identif inputs to a component. The second component now takes the table T 1 and produces the final table T 2. IX 0 IX 1 IX 2 V ALUE 2 SCORE z 2 (1,1) s z1 (1),2 (z 2 (1,1) )s x,1 (z 1 (1) ) z 2 (1,2) s(z 2 (1,2) )s(z 1 (1) ) k 2 z 2 (1,k 2 ) s(z 2 (1,k 2 ) )s(z 1 (1) ) z 2 (2,1) s(z 2 (2,1) )s(z 1 (2) ) k 1 k 2 z 2 (k 1,k 2 ) s(z 2 (k 1,k 2 ) )s(z 1 (k 1 ) ) The final output for top-k inference TopK is generated b grouping the values T 2.VALUE 2, aggregating b SCORE 2 and selecting the element with largest sum. For linear pipelines (Figure 5 without the curved edge dependencies), this procedure is summarized in Algorithm 1. The linear pipeline has the nice propert that each component has exactl one successor, so we need not store additional information to trace the provenance of each intermediate output. We assume each component provides the interface GenTopK(ID, k, INPUT) which returns a table with attributes VALUE ID, SCORE ID that has k tuples. For the n-stage linear pipeline, performing top-k inference with parameters k will ield a final table T n with at most k 1k 2... k n tuples. There ma be fewer tuples, since an output generated b different combinations of intermediate outputs results in onl one tuple. For top-k inference in general pipelines, the algorithm needs to maintain additional information to uniquel identif each input that a component processes. Specificall, for a component θ i, let the indices of its parents in the DAG be denoted b J. We need to construct a table that θ i can process, which has all compatible combinations of outputs in tables {T j} j J. This procedure is outlined in Algorithm 2. Let the lowest common ancestor of two parents of θ i (sa, θ p and θ q) be denoted b θ m. If θ p and θ q do not share an ancestor, m defaults to 0. The onl compatible output combinations between θ p and θ q are those that were generated from the same output from θ m. We store columns IX in our tables to identif precisel this fact: T i traces all ancestors of the component θ i. Hence, the natural join of tables T p and T q will match on the IX attributes corresponding to Algorithm 1 Top-k Inference for Linear Pipelines. Require: GenT opk(id, k, x), k T 0 CreateTable(V ALUE 0:X ; SCORE 0:1) for i = 1 to n do T i φ for t in T i 1 do x t.v ALUE i 1 s t.score i 1 OP GenT opk(i, k i, x) OP.SCORE i s OP.SCORE i T i T i OP end for if {Successors(i)} 1 then T i Sum(T i.score i, GroupB(V ALUE i)) end if T i T ransform(t i) {Section 3.2, 3.3} end for return Max(T n.score n) Algorithm 2 Creating Compatible Inputs. Require: J 1: p max J 2: T T p 3: A J \ p 4: repeat 5: q max A 6: r LowestCommonAncestor(p, q) 7: T T T q 8: T.SCORE r T.SCORE p Tq.SCOREq T r.score r 9: A A {r} \ {p, q} 10: p r 11: until A 1 12: return T their common ancestors up to m, as shown in lines 6-7 in Algorithm 2. Furthermore, we need to update the probabilit mass for each compatible output combination. From (6), we see that the product of the scores in T p and T q will have the contribution from their common ancestors double-counted. Hence, in line 8 of the algorithm, we divide out this score. With this algorithm, and the understanding that Gen- TopK(ID, k, INPUT) augments its returned table with IX ID information, inference proceeds exactl as the algorithm for the linear pipeline (Algorithm 1) with T i 1 replaced b the output of Algorithm 2 given Parents(i) as input. It should now be clear wh we alwas check the Successors(i) before aggregating tuples in a table: Such an aggregation step loses provenance information (in the form of IX attributes) which is necessar for correct recombination later in the pipeline. 3.2 Fixed Beam Inference The top-k inference procedure described above follows Equation (6) exactl. However, this can result in a multiplicative increase in the sizes of the intermediate tables, with the eventual output requiring aggregation over potentiall k 1k 2... k n outputs. This ma become infeasible for even moderate pipeline sizes and modest choices for k. One wa to control this blowup is to interpret the parameters k as limits on the sizes of the intermediate tables. In particular, at each stage i of the pipeline, we sort table T i

### Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects

### Natural Language to Relational Query by Using Parsing Compiler

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

### Guest Editors Introduction: Machine Learning in Speech and Language Technologies

Guest Editors Introduction: Machine Learning in Speech and Language Technologies Pascale Fung (pascale@ee.ust.hk) Department of Electrical and Electronic Engineering Hong Kong University of Science and

### K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

### Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

### Distributed Structured Prediction for Big Data

Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich aschwing@inf.ethz.ch T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning

### Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

### Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

### Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,

### Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

### Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

### Learning and Inference over Constrained Output

IJCAI 05 Learning and Inference over Constrained Output Vasin Punyakanok Dan Roth Wen-tau Yih Dav Zimak Department of Computer Science University of Illinois at Urbana-Champaign {punyakan, danr, yih, davzimak}@uiuc.edu

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

### CENG 734 Advanced Topics in Bioinformatics

CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the

### Shallow Parsing with Apache UIMA

Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland graham.wilcock@helsinki.fi Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/4 What is

### An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)

An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines) James Clarke, Vivek Srikumar, Mark Sammons, Dan Roth Department of Computer Science, University of Illinois, Urbana-Champaign.

### POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics

### DEPENDENCY PARSING JOAKIM NIVRE

DEPENDENCY PARSING JOAKIM NIVRE Contents 1. Dependency Trees 1 2. Arc-Factored Models 3 3. Online Learning 3 4. Eisner s Algorithm 4 5. Spanning Tree Parsing 6 References 7 A dependency parser analyzes

### Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers

Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers Sonal Gupta Christopher Manning Natural Language Processing Group Department of Computer Science Stanford University Columbia

### Learning from Diversity

Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology

### E6895 Big Data Analytics: Lecture 12. Mobile Vision. Ching-Yung Lin, Ph.D. IBM Chief Scientist, Graph Computing. December 1st, 2016

E6895 Big Data Analytics: Lecture 12 Mobile Vision Ching-Yung Lin, Ph.D. IBM Chief Scientist, Graph Computing Adjunct Professor, Dept. of Electrical Engineering and Computer Science December 1st, 2016

### KEYWORD SEARCH IN RELATIONAL DATABASES

KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to

### A Systematic Cross-Comparison of Sequence Classifiers

A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel grurgrur@gmail.com, feldman@cs.biu.ac.il,

### SVM Based Learning System For Information Extraction

SVM Based Learning System For Information Extraction Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham Department of Computer Science, The University of Sheffield, Sheffield, S1 4DP, UK {yaoyong,kalina,hamish}@dcs.shef.ac.uk

### Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

### A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,

### II. RELATED WORK. Sentiment Mining

Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

### Research Statement Immanuel Trummer www.itrummer.org

Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### ILLINOIS MATH SOLVER: Math Reasoning on the Web

ILLINOIS MATH SOLVER: Math Reasoning on the Web Subhro Roy and Dan Roth University of Illinois, Urbana Champaign {sroy9, danr}@illinois.edu Abstract There has been a recent interest in understanding text

### Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

### Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

### A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

A Framework-based Online Question Answering System Oliver Scheuer, Dan Shen, Dietrich Klakow Outline General Structure for Online QA System Problems in General Structure Framework-based Online QA system

### Università degli Studi di Bologna

Università degli Studi di Bologna DEIS Biometric System Laboratory Incremental Learning by Message Passing in Hierarchical Temporal Memory Davide Maltoni Biometric System Laboratory DEIS - University of

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

SIPAC Signals and Data Identification, Processing, Analysis, and Classification Framework for Mass Data Processing with Modules for Data Storage, Production and Configuration SIPAC key features SIPAC is

### English Grammar Checker

International l Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-3 E-ISSN: 2347-2693 English Grammar Checker Pratik Ghosalkar 1*, Sarvesh Malagi 2, Vatsal Nagda 3,

### Mavuno: A Scalable and Effective Hadoop-Based Paraphrase Acquisition System

Mavuno: A Scalable and Effective Hadoop-Based Paraphrase Acquisition System Donald Metzler and Eduard Hovy Information Sciences Institute University of Southern California Overview Mavuno Paraphrases 101

### SERVO CONTROL SYSTEMS 1: DC Servomechanisms

Servo Control Sstems : DC Servomechanisms SERVO CONTROL SYSTEMS : DC Servomechanisms Elke Laubwald: Visiting Consultant, control sstems principles.co.uk ABSTRACT: This is one of a series of white papers

### Compact Representations and Approximations for Compuation in Games

Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions

### Online Active Learning Methods for Fast Label-Efficient Spam Filtering

Online Active Learning Methods for Fast Label-Efficient Spam Filtering D. Sculley Department of Computer Science Tufts University, Medford, MA USA dsculley@cs.tufts.edu ABSTRACT Active learning methods

### Monotonicity Hints. Abstract

Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

### Concept Learning. Machine Learning 1

Concept Learning Inducing general functions from specific training examples is a main issue of machine learning. Concept Learning: Acquiring the definition of a general category from given sample positive

### large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

### Automatic Web Page Classification

Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

### Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg

Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual

### Semantic parsing with Structured SVM Ensemble Classification Models

Semantic parsing with Structured SVM Ensemble Classification Models Le-Minh Nguyen, Akira Shimazu, and Xuan-Hieu Phan Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa,

### Structured Models for Fine-to-Coarse Sentiment Analysis

Structured Models for Fine-to-Coarse Sentiment Analysis Ryan McDonald Kerry Hannan Tyler Neylon Mike Wells Jeff Reynar Google, Inc. 76 Ninth Avenue New York, NY 10011 Contact email: ryanmcd@google.com

### Data Mining Clustering. Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering Toon Calders Sheets are based on the those provided b Tan, Steinbach, and Kumar. Introduction to Data Mining What is Cluster Analsis? Finding groups of objects such that the objects

### Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

Procedia Computer Science 00 (2012) 1 21 Procedia Computer Science Top-k best probability queries and semantics ranking properties on probabilistic databases Trieu Minh Nhut Le, Jinli Cao, and Zhen He

### Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

### SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

### Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

### A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

### Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

### Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

### Tekniker för storskalig parsning

Tekniker för storskalig parsning Diskriminativa modeller Joakim Nivre Uppsala Universitet Institutionen för lingvistik och filologi joakim.nivre@lingfil.uu.se Tekniker för storskalig parsning 1(19) Generative

### 6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

### Machine Learning. Mark K Cowan. x 1. x

Machine Learning Mark K Cowan +3 +3 +3 +3 This book is based partl on content from the 2013 session of the on-line Machine Learning course run b Andrew Ng (Stanford Universit). The on-line course is provided

### RNN for Sentiment Analysis: Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

RNN for Sentiment Analysis: Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank Borui(Athena) Ye University of Waterloo borui.ye@uwaterloo.ca July 15, 2015 1 / 26 Overview 1 Introduction

### Make Better Decisions with Optimization

ABSTRACT Paper SAS1785-2015 Make Better Decisions with Optimization David R. Duling, SAS Institute Inc. Automated decision making systems are now found everywhere, from your bank to your government to

### Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania

Moral Hazard Itay Goldstein Wharton School, University of Pennsylvania 1 Principal-Agent Problem Basic problem in corporate finance: separation of ownership and control: o The owners of the firm are typically

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### End-to-End Sentiment Analysis of Twitter Data

End-to-End Sentiment Analysis of Twitter Data Apoor v Agarwal 1 Jasneet Singh Sabharwal 2 (1) Columbia University, NY, U.S.A. (2) Guru Gobind Singh Indraprastha University, New Delhi, India apoorv@cs.columbia.edu,

### Search and Information Retrieval

Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

### An Energy Efficient Location Service for Mobile Ad Hoc Networks

An Energ Efficient Location Service for Mobile Ad Hoc Networks Zijian Wang 1, Euphan Bulut 1 and Boleslaw K. Szmanski 1, 1 Department of Computer Science, Rensselaer Poltechnic Institute, Tro, NY 12180

### Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

### Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

### Lecture 11: Graphical Models for Inference

Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference - the Bayesian network and the Join tree. These two both represent the same joint probability

### INVESTIGATIONS AND FUNCTIONS 1.1.1 1.1.4. Example 1

Chapter 1 INVESTIGATIONS AND FUNCTIONS 1.1.1 1.1.4 This opening section introduces the students to man of the big ideas of Algebra 2, as well as different was of thinking and various problem solving strategies.

### Online Large-Margin Training of Dependency Parsers

Online Large-Margin Training of Dependency Parsers Ryan McDonald Koby Crammer Fernando Pereira Department of Computer and Information Science University of Pennsylvania Philadelphia, PA {ryantm,crammer,pereira}@cis.upenn.edu

### Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

### KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

### Semi-Supervised Learning for Blog Classification

Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,

### Topic and Trend Detection in Text Collections using Latent Dirichlet Allocation

Topic and Trend Detection in Text Collections using Latent Dirichlet Allocation Levent Bolelli 1, Şeyda Ertekin 2, and C. Lee Giles 3 1 Google Inc., 76 9 th Ave., 4 th floor, New York, NY 10011, USA 2

### The Minimum Consistent Subset Cover Problem and its Applications in Data Mining

The Minimum Consistent Subset Cover Problem and its Applications in Data Mining Byron J Gao 1,2, Martin Ester 1, Jin-Yi Cai 2, Oliver Schulte 1, and Hui Xiong 3 1 School of Computing Science, Simon Fraser

### The Graph of a Linear Equation

4.1 The Graph of a Linear Equation 4.1 OBJECTIVES 1. Find three ordered pairs for an equation in two variables 2. Graph a line from three points 3. Graph a line b the intercept method 4. Graph a line that

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

### BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

### Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. ahmad@abdulkader.org

### Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

### New Generation of Software Development

New Generation of Software Development Terry Hon University of British Columbia 201-2366 Main Mall Vancouver B.C. V6T 1Z4 tyehon@cs.ubc.ca ABSTRACT In this paper, I present a picture of what software development

### Three Effective Top-Down Clustering Algorithms for Location Database Systems

Three Effective Top-Down Clustering Algorithms for Location Database Systems Kwang-Jo Lee and Sung-Bong Yang Department of Computer Science, Yonsei University, Seoul, Republic of Korea {kjlee5435, yang}@cs.yonsei.ac.kr

### Predicting borrowers chance of defaulting on credit loans

Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm

### Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

### Ensembles and PMML in KNIME

Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany First.Last@Uni-Konstanz.De