Codewebs: Scalable Homework Search for Massive Open Online Programming Courses

Size: px
Start display at page:

Download "Codewebs: Scalable Homework Search for Massive Open Online Programming Courses"

Transcription

1 Codewebs: Scalable Homework Search for Massive Open Online Programming Courses Leonidas Guibas Andy Nguyen Chris Piech Jonathan Huang Educause, 11/6/2013

2 Massive Scale Education 4.5 million users 750,000 users 1.2 million users

3 Complex and informative feedback in MOOCs Multiple choice Coding assignments Proofs Essay questions Short Response Easy to automate Limited ability to ask expressive questions or require creativity Hard to grade automatically Long Response Can assign complex assignments and provide complex feedback 3

4 Binary feedback for coding questions Linear Regression submission (Homework 1) for Coursera s ML class Test Inputs function [theta, J_history] = gradientdescent(x, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha What would a human grader m = length(y); % number of training examples J_history = zeros(num_iters, 1); (TA) do? for iter = 1:num_iters theta = theta-alpha*1/m*(x'*(x*theta-y)); J_history(iter) = computecost(x, y, theta); end Test Outputs Correct / Incorrect? 4

5 Complex, informative feedback function [theta, J_history] = gradientdescent(x, y, theta, alpha, num_iters) m = length(y); J_history = zeros(num_iters, 1); for iter = 1:num_iters hypo = X*theta; newmat = hypo y; trans1 = (X(:,1)); trans1 = trans1 ; newmat1 = trans1 * newmat; temp1 = sum(newmat1); temp1 = (temp1 *alpha)/m; A = [temp1]; theta(1) = theta(1) - A; trans2 = (X(:,2)) ; newmat2 = trans2*newmat; temp2 = sum(newmat2); temp2 = (temp2 *alpha)/m; B = [temp2]; theta(2)= theta(2) - B; J_history(iter) = computecost(x, y, theta); end theta(1) = theta(1) theta(2)= theta(2); Better: theta = theta-(alpha/m) Why?? *X'*(X*theta-y) Correctness Efficiency Style Elegance Good Good Poor Poor 5

6 Technical Challenges sum(x'*(x*theta-y)); sum(((theta *X ) -y) *X); sum(transpose(x*theta-y)*x); Source code similarity Scalability

7 The Data Code Web: Improving Life for Future Tas John, Andy, Chris, Leo

8 Coursera Machine Learning > 1 million submissions

9 Why Care? Code Web: Improving Life for Future Tas John, Andy, Chris, Leo

10 The feedback paradox Fast feedback is key for student success [Dihoff 04] Human feedback is an overwhelming time commitment for teachers [Sadler 06].

11 Cost disease

12 Grand Challenge Automatically provide feedback to students (for programming assignments)

13 Basic Idea Code Web: Improving Life for Future Tas John, Andy, Chris, Leo

14 [Sad Teacher] Input: many ungraded submissions for a programming assignment

15 [Happy Teacher] Output: annotated submissions

16 [Effective [Happy Teacher] Force multiply teacher effort.

17 Build A Homework Search Engine Code Web: Improving Life for Future Tas John, Andy, Chris, Leo

18 Abstract syntax tree representations function A = warmupexercise() A = []; A = eye(5); endfunction ASSIGN IDENT (A) INDEX_EXP ASTs ignore: Whitespace Comments IDENT (eye) ARGUMENT_LIST CONST (5)

19 Abstract syntax tree representations function A = warmupexercise() A = []; A = eye(5); endfunction ASTs ignore: Whitespace Comments

20 Indexing documents by phrases blue sky yellow submarine The bright and blue butterfly hangs on the breeze term/phrase best {1,3} blue {2,4,6} document list What basic queries should code We all something something yellow submarine search engine bright support? {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}

21 Find the Feet That Fit Query ASTs that match on context

22 Find the Feet That Fit Query ASTs that match on context

23 Runtime (seconds) Runtime (seconds) Running time for indexing # ASTs indexed Average AST size (# nodes)

24 Subtree frequency Zipf s Law for code phrases starter code elbow Subtree rank

25 Great!

26 So what?

27 Data driven equivalence classes of code Code phrases Web: Improving Life for Future Tas John, Andy, Chris, Leo

28 There is Structure to Code

29 Exponential Solution Space def dosomething() { Part A Part B Part C 500 ways 200 ways 200 ways Without factoring: params = With factoring: params = }

30 Say you have two sub-forests Sub-forest Solution A Solution B

31 Each sub-forest has a complement Solution A Solution B

32 If those complements are equal Solution A Solution B

33 And the programs have the same output Solution A Solution B

34 The sub-forests were analogous in that context Code Block A Code Block B

35 {m} m rows (X) rows (y) size (X, 1) length (y) size (y, 1) length (x (:, 1)) length (X) size (X) (1) {alphaoverm} alpha / {m} 1 / {m} * alpha alpha.* (1 / {m}) alpha./ {m} alpha * inv ({m}) alpha * (1./ {m}) 1 * alpha / {m} alpha * pinv ({m}) alpha * 1./ {m} alpha.* 1 / {m} 1.* alpha./ {m} alpha * (1 / {m}).01 / {m} alpha.* (1./ {m}) alpha * {m} ^ -1 {hypothesis} (X * theta) (theta' * X')' [X] * theta (X * theta (:)) theta(1) + theta (2) * X (:, 2) sum(x.*repmat(theta',{m},1), 2) {residual} (X * theta - y) (theta' * X' - y')' ({hypothesis} - y) ({hypothesis}' - y )' [{hypothesis} - y] sum({hypothesis} - y, 2)

36 % remaining ASTs after canonicalization Reduction in #ASTs via canonicalization equivalence More class equivalence classes 90 means more reduction 80 means fewer ASTs We get better 10 results on the more common 0 ASTs # unique ASTs considered (Ordered by frequency of AST)

37 # submissions covered (out of 40,000) How many submissions can we give feedback to with fixed effort? with 25 ASTs marked with 200 ASTs marked # equivalence classes

38 # submissions covered (out of 40,000) How many submissions can we give feedback to with fixed effort? with 25 ASTs marked with 200 ASTs marked Equivalences + 25 marked No Equivalences marked # equivalence classes

39 Syntactic bug isolation using Codewebs Web: Improving Life for Future Tas John, Andy, Chris, Leo

40 Recall Bug detection evaluation (Difficult to evaluate localization!!) Each point represents a single coding problem in Coursera s ML class regularized backpropagation for neural nets Precision

41 Higher is better F-score Bug detection evaluation with canonicalization without canonicalization # unique ASTs considered

42 Concrete Example Code Web: Improving Life for Future Tas John, Andy, Chris, Leo

43 Case study: The extraneous sum function [theta, J_history] = gradientdescent(x, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha Syntax based approach: m = length(y); % number of training examples J_history = zeros(num_iters, 1); Attach this message to everyone containing that exact expression for iter = 1:num_iters theta = theta-alpha*1/m*(x'*(x*theta-y)); J_history(iter) (covers = computecost(x, 99 submissions) y, theta); end function [theta, J_history] = gradientdescent(x, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent Dear to learn Lisa Simpson, theta consider the % theta = GRADIENTDESCENT(X, y, theta, dimension alpha, num_iters) of the expression: updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); Correct X'*(X*theta-y) and what happens after you call sum on it for iter = 1:num_iters theta = theta-alpha*1/m*sum(x'*(x*theta-y)); J_history(iter) = computecost(x, y, theta); end Incorrect

44 The extraneous sum bug takes many forms theta = theta-alpha*1/m*sum(x'*(x*theta-y)); theta = theta-alpha*1/m*sum(((theta *X ) -y) *X); theta = theta-alpha*1/m*sum(transpose(x*theta-y)*x); (Easier) Output based approach: Attach this message to everyone containing that exact expression (covers 1091 submissions)

45 Codewebs approach to feedback theta = theta-alpha*1/m*sum(x'*(x*theta-y)); Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation ~47% improvement over Step 3: Propagate just feedback using an message output based to any submission containing feedback equivalent system!! expression 1091 Output based 1208 Codewebs 1604 Combined # submissions covered by single message

46 Big data as a problem: can we give human quality feedback to a million code submissions? Big data as a solution: structure (clustering, subtree equivalence) of the solution space invisible without dense sampling! Where we re going next: Extensions to other languages Applications to new problem domains Data association problem for variables Dynamic analysis Temporal analysis

47 How do solutions evolve modeling the human creative process Feedback on progress, not just on final solution Think beyond computer science, beyond education

Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. [email protected]

Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang [email protected] http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel

More information

Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

Journée Thématique Big Data 13/03/2015

Journée Thématique Big Data 13/03/2015 Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets

More information

CS/COE 1501 http://cs.pitt.edu/~bill/1501/

CS/COE 1501 http://cs.pitt.edu/~bill/1501/ CS/COE 1501 http://cs.pitt.edu/~bill/1501/ Lecture 01 Course Introduction Meta-notes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge

More information

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error

More information

14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) Overview Kyrre Glette kyrrehg@ifi INF3490 Swarm Intelligence Particle Swarm Optimization Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) 3 Swarms in nature Fish, birds,

More information

Designing Programming Exercises with Computer Assisted Instruction *

Designing Programming Exercises with Computer Assisted Instruction * Designing Programming Exercises with Computer Assisted Instruction * Fu Lee Wang 1, and Tak-Lam Wong 2 1 Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong [email protected]

More information

Introduction to Machine Learning Using Python. Vikram Kamath

Introduction to Machine Learning Using Python. Vikram Kamath Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

MasteryTrack: System Overview

MasteryTrack: System Overview MasteryTrack: System Overview March 16, 2015 Table of Contents Background Objectives Foundational Principles Mastery is Binary Demonstrating Mastery Scope of Activity Level of Accuracy Time System Elements

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Mining a Change-Based Software Repository

Mining a Change-Based Software Repository Mining a Change-Based Software Repository Romain Robbes Faculty of Informatics University of Lugano, Switzerland 1 Introduction The nature of information found in software repositories determines what

More information

COS 116 The Computational Universe Laboratory 9: Virus and Worm Propagation in Networks

COS 116 The Computational Universe Laboratory 9: Virus and Worm Propagation in Networks COS 116 The Computational Universe Laboratory 9: Virus and Worm Propagation in Networks You learned in lecture about computer viruses and worms. In this lab you will study virus propagation at the quantitative

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Machine Learning over Big Data

Machine Learning over Big Data Machine Learning over Big Presented by Fuhao Zou [email protected] Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

DYNAMIC QUERY FORMS WITH NoSQL

DYNAMIC QUERY FORMS WITH NoSQL IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 7, Jul 2014, 157-162 Impact Journals DYNAMIC QUERY FORMS WITH

More information

Cassandra. References:

Cassandra. References: Cassandra References: Becker, Moritz; Sewell, Peter. Cassandra: Flexible Trust Management, Applied to Electronic Health Records. 2004. Li, Ninghui; Mitchell, John. Datalog with Constraints: A Foundation

More information

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

Using Adaptive Random Trees (ART) for optimal scorecard segmentation A FAIR ISAAC WHITE PAPER Using Adaptive Random Trees (ART) for optimal scorecard segmentation By Chris Ralph Analytic Science Director April 2006 Summary Segmented systems of models are widely recognized

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

Beads Under the Cloud

Beads Under the Cloud Beads Under the Cloud Intermediate/Middle School Grades Problem Solving Mathematics Formative Assessment Lesson Designed and revised by Kentucky Department of Education Mathematics Specialists Field-tested

More information

Fall 2012 Q530. Programming for Cognitive Science

Fall 2012 Q530. Programming for Cognitive Science Fall 2012 Q530 Programming for Cognitive Science Aimed at little or no programming experience. Improve your confidence and skills at: Writing code. Reading code. Understand the abilities and limitations

More information

Machine Learning and Data Mining -

Machine Learning and Data Mining - Machine Learning and Data Mining - Perceptron Neural Networks Nuno Cavalheiro Marques ([email protected]) Spring Semester 2010/2011 MSc in Computer Science Multi Layer Perceptron Neurons and the Perceptron

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

Checking for Dimensional Correctness in Physics Equations

Checking for Dimensional Correctness in Physics Equations Checking for Dimensional Correctness in Physics Equations C.W. Liew Department of Computer Science Lafayette College [email protected] D.E. Smith Department of Computer Science Rutgers University [email protected]

More information

Big Data Text Mining and Visualization. Anton Heijs

Big Data Text Mining and Visualization. Anton Heijs Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark

More information

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data CS535 Big Data W1.A.1 CS535 BIG DATA W1.A.2 Let the data speak to you Medication Adherence Score How likely people are to take their medication, based on: How long people have lived at the same address

More information

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014 Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Chapter 12. Introduction. Introduction. User Documentation and Online Help

Chapter 12. Introduction. Introduction. User Documentation and Online Help Chapter 12 User Documentation and Online Help Introduction When it comes to learning about computer systems many people experience anxiety, frustration, and disappointment Even though increasing attention

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

A Python Tour: Just a Brief Introduction CS 303e: Elements of Computers and Programming

A Python Tour: Just a Brief Introduction CS 303e: Elements of Computers and Programming A Python Tour: Just a Brief Introduction CS 303e: Elements of Computers and Programming "The only way to learn a new programming language is by writing programs in it." -- B. Kernighan and D. Ritchie "Computers

More information

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research [email protected]

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research [email protected] Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

Keywords data mining, prediction techniques, decision making.

Keywords data mining, prediction techniques, decision making. Volume 5, Issue 4, April 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analysis of Datamining

More information

Lecture 8 February 4

Lecture 8 February 4 ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

More information

Recovering Business Rules from Legacy Source Code for System Modernization

Recovering Business Rules from Legacy Source Code for System Modernization Recovering Business Rules from Legacy Source Code for System Modernization Erik Putrycz, Ph.D. Anatol W. Kark Software Engineering Group National Research Council, Canada Introduction Legacy software 000009*

More information

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015 W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Instant SQL Programming

Instant SQL Programming Instant SQL Programming Joe Celko Wrox Press Ltd. INSTANT Table of Contents Introduction 1 What Can SQL Do for Me? 2 Who Should Use This Book? 2 How To Use This Book 3 What You Should Know 3 Conventions

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

SPECIFICATION BY EXAMPLE. Gojko Adzic. How successful teams deliver the right software. MANNING Shelter Island

SPECIFICATION BY EXAMPLE. Gojko Adzic. How successful teams deliver the right software. MANNING Shelter Island SPECIFICATION BY EXAMPLE How successful teams deliver the right software Gojko Adzic MANNING Shelter Island Brief Contents 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Preface xiii Acknowledgments xxii

More information

MONITORING AND DIAGNOSIS OF A MULTI-STAGE MANUFACTURING PROCESS USING BAYESIAN NETWORKS

MONITORING AND DIAGNOSIS OF A MULTI-STAGE MANUFACTURING PROCESS USING BAYESIAN NETWORKS MONITORING AND DIAGNOSIS OF A MULTI-STAGE MANUFACTURING PROCESS USING BAYESIAN NETWORKS Eric Wolbrecht Bruce D Ambrosio Bob Paasch Oregon State University, Corvallis, OR Doug Kirby Hewlett Packard, Corvallis,

More information

This document contains Chapter 2: Statistics, Data Analysis, and Probability strand from the 2008 California High School Exit Examination (CAHSEE):

This document contains Chapter 2: Statistics, Data Analysis, and Probability strand from the 2008 California High School Exit Examination (CAHSEE): This document contains Chapter 2:, Data Analysis, and strand from the 28 California High School Exit Examination (CAHSEE): Mathematics Study Guide published by the California Department of Education. The

More information

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc. A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc. Introduction: The Basel Capital Accord, ready for implementation in force around 2006, sets out

More information

Gerard Mc Nulty Systems Optimisation Ltd [email protected]/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd [email protected]/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

Microsoft Azure Machine learning Algorithms

Microsoft Azure Machine learning Algorithms Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql [email protected] http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation

More information

Web Data Mining: A Case Study. Abstract. Introduction

Web Data Mining: A Case Study. Abstract. Introduction Web Data Mining: A Case Study Samia Jones Galveston College, Galveston, TX 77550 Omprakash K. Gupta Prairie View A&M, Prairie View, TX 77446 [email protected] Abstract With an enormous amount of data stored

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits Outline NP-completeness Examples of Easy vs. Hard problems Euler circuit vs. Hamiltonian circuit Shortest Path vs. Longest Path 2-pairs sum vs. general Subset Sum Reducing one problem to another Clique

More information

Open source business rules management system

Open source business rules management system JBoss Enterprise BRMS Open source business rules management system What is it? JBoss Enterprise BRMS is an open source business rules management system that enables easy business policy and rules development,

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016 Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency

More information

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. [email protected] Mrs.

More information

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015 CPSC 340: Machine Learning and Data Mining Mark Schmidt University of British Columbia Fall 2015 Outline 1) Intro to Machine Learning and Data Mining: Big data phenomenon and types of data. Definitions

More information

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting Inf1-DA 2010 2011 III: 1 / 89 Informatics 1 School of Informatics, University of Edinburgh Data and Analysis Part III Unstructured Data Ian Stark February 2011 Inf1-DA 2010 2011 III: 2 / 89 Part III Unstructured

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.035, Spring 2013 Handout Scanner-Parser Project Thursday, Feb 7 DUE: Wednesday, Feb 20, 9:00 pm This project

More information

Pattern Insight Clone Detection

Pattern Insight Clone Detection Pattern Insight Clone Detection TM The fastest, most effective way to discover all similar code segments What is Clone Detection? Pattern Insight Clone Detection is a powerful pattern discovery technology

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time

More information

Using Formulas, Functions, and Data Analysis Tools Excel 2010 Tutorial

Using Formulas, Functions, and Data Analysis Tools Excel 2010 Tutorial Using Formulas, Functions, and Data Analysis Tools Excel 2010 Tutorial Excel file for use with this tutorial Tutor1Data.xlsx File Location http://faculty.ung.edu/kmelton/data/tutor1data.xlsx Introduction:

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web

More information

Classification using Logistic Regression

Classification using Logistic Regression Classification using Logistic Regression Ingmar Schuster Patrick Jähnichen using slides by Andrew Ng Institut für Informatik This lecture covers Logistic regression hypothesis Decision Boundary Cost function

More information

Sample Fraction Addition and Subtraction Concepts Activities 1 3

Sample Fraction Addition and Subtraction Concepts Activities 1 3 Sample Fraction Addition and Subtraction Concepts Activities 1 3 College- and Career-Ready Standard Addressed: Build fractions from unit fractions by applying and extending previous understandings of operations

More information

Experiences with Online Programming Examinations

Experiences with Online Programming Examinations Experiences with Online Programming Examinations Monica Farrow and Peter King School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS Abstract An online programming examination

More information

Lecture 1: Course overview, circuits, and formulas

Lecture 1: Course overview, circuits, and formulas Lecture 1: Course overview, circuits, and formulas Topics in Complexity Theory and Pseudorandomness (Spring 2013) Rutgers University Swastik Kopparty Scribes: John Kim, Ben Lund 1 Course Information Swastik

More information

Chapter 4: Artificial Neural Networks

Chapter 4: Artificial Neural Networks Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/

More information

Audit Analytics. --An innovative course at Rutgers. Qi Liu. Roman Chinchila

Audit Analytics. --An innovative course at Rutgers. Qi Liu. Roman Chinchila Audit Analytics --An innovative course at Rutgers Qi Liu Roman Chinchila A new certificate in Analytic Auditing Tentative courses: Audit Analytics Special Topics in Audit Analytics Forensic Accounting

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

Information Visualization WS 2013/14 11 Visual Analytics

Information Visualization WS 2013/14 11 Visual Analytics 1 11.1 Definitions and Motivation Lot of research and papers in this emerging field: Visual Analytics: Scope and Challenges of Keim et al. Illuminating the path of Thomas and Cook 2 11.1 Definitions and

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Orthogonal Projections and Orthonormal Bases

Orthogonal Projections and Orthonormal Bases CS 3, HANDOUT -A, 3 November 04 (adjusted on 7 November 04) Orthogonal Projections and Orthonormal Bases (continuation of Handout 07 of 6 September 04) Definition (Orthogonality, length, unit vectors).

More information

Equity forecast: Predicting long term stock price movement using machine learning

Equity forecast: Predicting long term stock price movement using machine learning Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK [email protected] Abstract Long

More information

Simulated learners in peers assessment for introductory programming courses

Simulated learners in peers assessment for introductory programming courses Simulated learners in peers assessment for introductory programming courses Alexandre de Andrade Barbosa 1,3 and Evandro de Barros Costa 2,3 1 Federal University of Alagoas - Arapiraca Campus, Arapiraca

More information

A Concrete Introduction. to the Abstract Concepts. of Integers and Algebra using Algebra Tiles

A Concrete Introduction. to the Abstract Concepts. of Integers and Algebra using Algebra Tiles A Concrete Introduction to the Abstract Concepts of Integers and Algebra using Algebra Tiles Table of Contents Introduction... 1 page Integers 1: Introduction to Integers... 3 2: Working with Algebra Tiles...

More information

The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information