# Beyond Myopic Inference in Big Data Pipelines

Size: px
Start display at page:

## Transcription

3 X Z 1 Z 2 Y θ 1 θ 2 θ 3 θ n Figure 5: Schematic of a general pipeline. Going beond the composition of two components, an general pipeline can be viewed as a Directed Acclic Graph, where the nodes are the black box components and the edges denote which outputs are piped as inputs to which components. The basic idea illustrated above for a linear twostep pipeline can be extended to general pipelines such as the one illustrated in Figure 5. An general pipeline structured as a DAG admits a topological ordering. Denoting the input to the pipeline as X, eventual output Y, and the output of each intermediate stage i in this topological ordering as Z i, then an acclic pipeline is a subgraph of the graph shown in Figure 5. From our modeling assumption, this DAG is an equivalent representation of a directed graphical model: each of X, Z 1, Z 2... Y are nodes denoting random variables, and the components enforce conditional independences. The graphical model corresponding to Figure 5 makes the fewest independence assumptions across stages. Analogous to the two stage pipeline from above, the Baes Optimal prediction Baes for general pipelines is Baes = argmax ( z 3 z 1 Pr(z 1 x, θ 1) z 2 Pr(z2 x, z 1, θ 2) ) Pr(z 3 x,z 1,z 2,θ 3)...Pr( x,z 1...z n 1,θ n)... ]] (3) Computing the Baes Optimal Baes is infeasible for realworld pipelines because it requires summing up over all possible intermediate output combinations z 1,... z n 1. We will therefore explore in the following section how this intractable sum can be approximated efficientl using the information that the Top-k API provides. 2.2 Canonical Inference Even the naive model of passing onl the top output to the next stage can be viewed as a crude approximation of the Baes Optimal output. We call this procedure the Canonical Inference algorithm. For simplicit, consider again the two component pipeline, such that the naive model results in z 1 (1) = argmax z 1 Pr(z 1 x, θ 1) Can = argmax Pr( z (1) 1, θ 2) (4) The optimization over in (4) is unaffected if it is multiplied b the term Pr(z 1 (1) x, θ 1). Once the summation in the marginal in (1) is restricted to z 1 (1), the sum reduces to Pr( z 1 (1), θ 2) Pr(z 1 (1) x, θ 1) Pr( x). Hence, this objective provides a lower bound on the true Pr( x). It is eas to see that this holds true for general pipeline architectures too, where the canonical inference procedure proceeds stage-b-stage through the topological ordering. For i = 1,..., n, define: z i (1) = argmax z i Pr(z i x, z 1 (1),..., z i 1 (1), θ i) Can = argmax Pr( x, z (1) 1,..., z (1) n 1, θ n) (5) then Pr( z 1 (1),...,z n 1 (1),θ n)... Pr(z 1 (1) x,θ 1) Pr( x). In this sense, canonical inference is optimizing a crude lower bound to the Baes Optimal objective (again, the terms Pr(z 1 (1) x, θ 1) etc. do not affect the optimization over ). This suggests how inference can be improved, namel b computing improved lower bounds on Pr( x). This is the direction we take in Section 3, where we exploit the Top-k API and explore several was for making the computation of this bound more efficient. 3. EFFICIENT INFERENCE IN PIPELINES Canonical inference and full Baesian inference are two extremes of the efficienc vs. accurac tradeoff. We now explore how to strike a balance between the two when pipeline components expose the Top-k API. In particular, given input x suppose each component θ is ] capable of producing a list of k outputs (1), (2),... (k) along with their conditional ] probabilities Pr( (1) x, θ),... Pr( (k) x, θ) (in decreasing order). We now outline three algorithms, characterize their computational efficienc, and describe their implementation in terms of relational algebra. This interpretation enables direct implementation of these algorithms on top of traditional/probabilistic database sstems. 3.1 Top-k Inference The natural approach to approximating the full Baesian marginal (Eq. 3) when computation is limited to a few samples is to choose the samples as the top-k realizations of the random variable, weighted b their conditional probabilities. We call this Top-k Inference, which proceeds in an analogous fashion to canonical inference, stage-b-stage through the topological ordering. However, inference now becomes more involved. Each component produces a scored list, and each item in the lists generated b predecessors will have a corresponding scored list as output. Denote the list of outputs returned b the component θ i (each of which are possible values for the random variable Z i in the equivalent graphical model) as {z i} k. Note that this set depends on the inputs to θ i, but we omit this dependence from the notation to reduce clutter. This gives rise to the following optimization problem for identifing the eventual output of the pipeline. TopK = argmax {z 3 } k {z 1 } k Pr(z 1 x, θ 1) {z 2 } k Pr(z2 x, z 1, θ 2) ( ) Pr(z 3 x,z 1,z 2,θ 3)...Pr( x,z 1,...z n 1,θ n)... ]] (6) A simple extension of top-k inference is to have a parameter k i for each component θ i that allows variable length lists of outputs {z i} ki for different components in the pipeline. We refer to these parameters collectivel as k. Consider again the simple two stage pipeline in Figure 4. Each component now operates on relational tables and generates relational tables. We initialize the table T 0 with the input to the pipeline x:

4 IX 0 V ALUE 0 SCORE 0 1 x 1 The subscripts in the attributes refers to the component ID, IX denotes the index into the scored list of outputs produced b the component, VALUE is the actual output and SCORE is its conditional score. Now, given a parameter vector k = (k 1, k 2), the first component will generate the table T 1. IX 0 IX 1 V ALUE 1 SCORE z 1 (1) s x,1 (z 1 (1) ) 1 2 z 1 (2) s x,1 (z 1 (2) ) k 1 z 1 (k 1 ) s x,1 (z 1 (k 1 ) ) Note that IX 0 carries no additional information, if component 1 is the onl successor to component 0 (as is the case in this example). However, in a general pipeline, we need columns like IX 0 to uniquel identif inputs to a component. The second component now takes the table T 1 and produces the final table T 2. IX 0 IX 1 IX 2 V ALUE 2 SCORE z 2 (1,1) s z1 (1),2 (z 2 (1,1) )s x,1 (z 1 (1) ) z 2 (1,2) s(z 2 (1,2) )s(z 1 (1) ) k 2 z 2 (1,k 2 ) s(z 2 (1,k 2 ) )s(z 1 (1) ) z 2 (2,1) s(z 2 (2,1) )s(z 1 (2) ) k 1 k 2 z 2 (k 1,k 2 ) s(z 2 (k 1,k 2 ) )s(z 1 (k 1 ) ) The final output for top-k inference TopK is generated b grouping the values T 2.VALUE 2, aggregating b SCORE 2 and selecting the element with largest sum. For linear pipelines (Figure 5 without the curved edge dependencies), this procedure is summarized in Algorithm 1. The linear pipeline has the nice propert that each component has exactl one successor, so we need not store additional information to trace the provenance of each intermediate output. We assume each component provides the interface GenTopK(ID, k, INPUT) which returns a table with attributes VALUE ID, SCORE ID that has k tuples. For the n-stage linear pipeline, performing top-k inference with parameters k will ield a final table T n with at most k 1k 2... k n tuples. There ma be fewer tuples, since an output generated b different combinations of intermediate outputs results in onl one tuple. For top-k inference in general pipelines, the algorithm needs to maintain additional information to uniquel identif each input that a component processes. Specificall, for a component θ i, let the indices of its parents in the DAG be denoted b J. We need to construct a table that θ i can process, which has all compatible combinations of outputs in tables {T j} j J. This procedure is outlined in Algorithm 2. Let the lowest common ancestor of two parents of θ i (sa, θ p and θ q) be denoted b θ m. If θ p and θ q do not share an ancestor, m defaults to 0. The onl compatible output combinations between θ p and θ q are those that were generated from the same output from θ m. We store columns IX in our tables to identif precisel this fact: T i traces all ancestors of the component θ i. Hence, the natural join of tables T p and T q will match on the IX attributes corresponding to Algorithm 1 Top-k Inference for Linear Pipelines. Require: GenT opk(id, k, x), k T 0 CreateTable(V ALUE 0:X ; SCORE 0:1) for i = 1 to n do T i φ for t in T i 1 do x t.v ALUE i 1 s t.score i 1 OP GenT opk(i, k i, x) OP.SCORE i s OP.SCORE i T i T i OP end for if {Successors(i)} 1 then T i Sum(T i.score i, GroupB(V ALUE i)) end if T i T ransform(t i) {Section 3.2, 3.3} end for return Max(T n.score n) Algorithm 2 Creating Compatible Inputs. Require: J 1: p max J 2: T T p 3: A J \ p 4: repeat 5: q max A 6: r LowestCommonAncestor(p, q) 7: T T T q 8: T.SCORE r T.SCORE p Tq.SCOREq T r.score r 9: A A {r} \ {p, q} 10: p r 11: until A 1 12: return T their common ancestors up to m, as shown in lines 6-7 in Algorithm 2. Furthermore, we need to update the probabilit mass for each compatible output combination. From (6), we see that the product of the scores in T p and T q will have the contribution from their common ancestors double-counted. Hence, in line 8 of the algorithm, we divide out this score. With this algorithm, and the understanding that Gen- TopK(ID, k, INPUT) augments its returned table with IX ID information, inference proceeds exactl as the algorithm for the linear pipeline (Algorithm 1) with T i 1 replaced b the output of Algorithm 2 given Parents(i) as input. It should now be clear wh we alwas check the Successors(i) before aggregating tuples in a table: Such an aggregation step loses provenance information (in the form of IX attributes) which is necessar for correct recombination later in the pipeline. 3.2 Fixed Beam Inference The top-k inference procedure described above follows Equation (6) exactl. However, this can result in a multiplicative increase in the sizes of the intermediate tables, with the eventual output requiring aggregation over potentiall k 1k 2... k n outputs. This ma become infeasible for even moderate pipeline sizes and modest choices for k. One wa to control this blowup is to interpret the parameters k as limits on the sizes of the intermediate tables. In particular, at each stage i of the pipeline, we sort table T i

### Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects

### Natural Language to Relational Query by Using Parsing Compiler

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

### K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

### Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

### Guest Editors Introduction: Machine Learning in Speech and Language Technologies

Guest Editors Introduction: Machine Learning in Speech and Language Technologies Pascale Fung (pascale@ee.ust.hk) Department of Electrical and Electronic Engineering Hong Kong University of Science and

### Distributed Structured Prediction for Big Data

Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich aschwing@inf.ethz.ch T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning

### Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

### Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,

### Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

### Learning and Inference over Constrained Output

IJCAI 05 Learning and Inference over Constrained Output Vasin Punyakanok Dan Roth Wen-tau Yih Dav Zimak Department of Computer Science University of Illinois at Urbana-Champaign {punyakan, danr, yih, davzimak}@uiuc.edu

### Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

### Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/4 What is

### An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)

An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines) James Clarke, Vivek Srikumar, Mark Sammons, Dan Roth Department of Computer Science, University of Illinois, Urbana-Champaign.

### Shallow Parsing with Apache UIMA

Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland graham.wilcock@helsinki.fi Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic

### Learning from Diversity

Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### CENG 734 Advanced Topics in Bioinformatics

CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the

### DEPENDENCY PARSING JOAKIM NIVRE

DEPENDENCY PARSING JOAKIM NIVRE Contents 1. Dependency Trees 1 2. Arc-Factored Models 3 3. Online Learning 3 4. Eisner s Algorithm 4 5. Spanning Tree Parsing 6 References 7 A dependency parser analyzes

### Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

### POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics

### KEYWORD SEARCH IN RELATIONAL DATABASES

KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to

### II. RELATED WORK. Sentiment Mining

Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### A Systematic Cross-Comparison of Sequence Classifiers

A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel grurgrur@gmail.com, feldman@cs.biu.ac.il,

### Compact Representations and Approximations for Compuation in Games

Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions

### SVM Based Learning System For Information Extraction

SVM Based Learning System For Information Extraction Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham Department of Computer Science, The University of Sheffield, Sheffield, S1 4DP, UK {yaoyong,kalina,hamish}@dcs.shef.ac.uk

### Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

### Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

### Research Statement Immanuel Trummer www.itrummer.org

Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses

### Data Mining Clustering. Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering Toon Calders Sheets are based on the those provided b Tan, Steinbach, and Kumar. Introduction to Data Mining What is Cluster Analsis? Finding groups of objects such that the objects

### A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,

### Università degli Studi di Bologna

Università degli Studi di Bologna DEIS Biometric System Laboratory Incremental Learning by Message Passing in Hierarchical Temporal Memory Davide Maltoni Biometric System Laboratory DEIS - University of

### Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg

Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### SERVO CONTROL SYSTEMS 1: DC Servomechanisms

Servo Control Sstems : DC Servomechanisms SERVO CONTROL SYSTEMS : DC Servomechanisms Elke Laubwald: Visiting Consultant, control sstems principles.co.uk ABSTRACT: This is one of a series of white papers

### A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

A Framework-based Online Question Answering System Oliver Scheuer, Dan Shen, Dietrich Klakow Outline General Structure for Online QA System Problems in General Structure Framework-based Online QA system

### Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

### English Grammar Checker

International l Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-3 E-ISSN: 2347-2693 English Grammar Checker Pratik Ghosalkar 1*, Sarvesh Malagi 2, Vatsal Nagda 3,

### Semantic parsing with Structured SVM Ensemble Classification Models

Semantic parsing with Structured SVM Ensemble Classification Models Le-Minh Nguyen, Akira Shimazu, and Xuan-Hieu Phan Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa,

### A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

### SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

SIPAC Signals and Data Identification, Processing, Analysis, and Classification Framework for Mass Data Processing with Modules for Data Storage, Production and Configuration SIPAC key features SIPAC is

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

### Online Active Learning Methods for Fast Label-Efficient Spam Filtering

Online Active Learning Methods for Fast Label-Efficient Spam Filtering D. Sculley Department of Computer Science Tufts University, Medford, MA USA dsculley@cs.tufts.edu ABSTRACT Active learning methods

### large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

### SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute

### Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

### Tekniker för storskalig parsning

Tekniker för storskalig parsning Diskriminativa modeller Joakim Nivre Uppsala Universitet Institutionen för lingvistik och filologi joakim.nivre@lingfil.uu.se Tekniker för storskalig parsning 1(19) Generative

### INVESTIGATIONS AND FUNCTIONS 1.1.1 1.1.4. Example 1

Chapter 1 INVESTIGATIONS AND FUNCTIONS 1.1.1 1.1.4 This opening section introduces the students to man of the big ideas of Algebra 2, as well as different was of thinking and various problem solving strategies.

### 6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

### Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

### Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

### Monotonicity Hints. Abstract

Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

### Structured Models for Fine-to-Coarse Sentiment Analysis

Structured Models for Fine-to-Coarse Sentiment Analysis Ryan McDonald Kerry Hannan Tyler Neylon Mike Wells Jeff Reynar Google, Inc. 76 Ninth Avenue New York, NY 10011 Contact email: ryanmcd@google.com

### Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

### BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

### Search and Information Retrieval

Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

### An Energy Efficient Location Service for Mobile Ad Hoc Networks

An Energ Efficient Location Service for Mobile Ad Hoc Networks Zijian Wang 1, Euphan Bulut 1 and Boleslaw K. Szmanski 1, 1 Department of Computer Science, Rensselaer Poltechnic Institute, Tro, NY 12180

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

Procedia Computer Science 00 (2012) 1 21 Procedia Computer Science Top-k best probability queries and semantics ranking properties on probabilistic databases Trieu Minh Nhut Le, Jinli Cao, and Zhen He

### SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline

### An Introduction to Data Mining

An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

### Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

### Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

### Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

### Make Better Decisions with Optimization

ABSTRACT Paper SAS1785-2015 Make Better Decisions with Optimization David R. Duling, SAS Institute Inc. Automated decision making systems are now found everywhere, from your bank to your government to

### Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania

Moral Hazard Itay Goldstein Wharton School, University of Pennsylvania 1 Principal-Agent Problem Basic problem in corporate finance: separation of ownership and control: o The owners of the firm are typically

### The Minimum Consistent Subset Cover Problem and its Applications in Data Mining

The Minimum Consistent Subset Cover Problem and its Applications in Data Mining Byron J Gao 1,2, Martin Ester 1, Jin-Yi Cai 2, Oliver Schulte 1, and Hui Xiong 3 1 School of Computing Science, Simon Fraser

### Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

### Three Effective Top-Down Clustering Algorithms for Location Database Systems

Three Effective Top-Down Clustering Algorithms for Location Database Systems Kwang-Jo Lee and Sung-Bong Yang Department of Computer Science, Yonsei University, Seoul, Republic of Korea {kjlee5435, yang}@cs.yonsei.ac.kr

### Semi-Supervised Learning for Blog Classification

Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,

### Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### The Graph of a Linear Equation

4.1 The Graph of a Linear Equation 4.1 OBJECTIVES 1. Find three ordered pairs for an equation in two variables 2. Graph a line from three points 3. Graph a line b the intercept method 4. Graph a line that

### Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

### Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

### A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington

### Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise

### 131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

### Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

### Ensembles and PMML in KNIME

Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany First.Last@Uni-Konstanz.De

### Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. ahmad@abdulkader.org

### End-to-End Sentiment Analysis of Twitter Data

End-to-End Sentiment Analysis of Twitter Data Apoor v Agarwal 1 Jasneet Singh Sabharwal 2 (1) Columbia University, NY, U.S.A. (2) Guru Gobind Singh Indraprastha University, New Delhi, India apoorv@cs.columbia.edu,

### New Generation of Software Development

New Generation of Software Development Terry Hon University of British Columbia 201-2366 Main Mall Vancouver B.C. V6T 1Z4 tyehon@cs.ubc.ca ABSTRACT In this paper, I present a picture of what software development

### For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall

Cluster Validation Cluster Validit For supervised classification we have a variet of measures to evaluate how good our model is Accurac, precision, recall For cluster analsis, the analogous question is

### A Business Process Services Portal

A Business Process Services Portal IBM Research Report RZ 3782 Cédric Favre 1, Zohar Feldman 3, Beat Gfeller 1, Thomas Gschwind 1, Jana Koehler 1, Jochen M. Küster 1, Oleksandr Maistrenko 1, Alexandru

### KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Autonomous Equations / Stability of Equilibrium Solutions. y = f (y).

Autonomous Equations / Stabilit of Equilibrium Solutions First order autonomous equations, Equilibrium solutions, Stabilit, Longterm behavior of solutions, direction fields, Population dnamics and logistic

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification