Machine Learning. Goals. Does it works. Notes

Machine learning Introduction All the techniques that we have seen until now allow us to build intelligent systems The limitation of these systems is that they only can solve the problems their are programmed for But we only should consider a system intelligent if is also able to observe its environment and learn from it The real intelligence resides in adaptation, to be able to integrate new knowledge, to solve new problems, to learn from mistakes (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 1 / 28 Goals Introduction The goal is not to model human learning The goal is to overcome the limitations of usual AI applications (KBS, planning, NLP, problem solving,...): Their limit is in the knowledge that they have Their capacities can not reach outside that limits It is not possible to foresee all possible problems from the beginning We are looking for programs that can adapt without being reprogrammed (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 2 / 28 Does it works Introduction Does Really Work? Tom Mitchell. AI Magazine 1997 Where and what can machine learning be applied for? Tasks very difficult to program (face recognition, voice,...) Adaptable applications (intelligent interfaces, spam filters, recommendation systems,...) Data mining (intelligent data analysis) (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 3 / 28

Types of machine learning Introduction Inductive Learning: Models are built from the generalization of examples. We look for patterns that explain the common characteristics of the examples. Deductive Learning: Deduction is applied to obtain generalizations from a solved example and its explanation. Genetic learning: Algorithms inspired in the theory of evolution are applied to find general description to groups of examples. onnexionist learning: Generalization is performed by the adaptation mechanisms of artificial neural networks. (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 4 / 28 Inductive learning Inductive Learning Is the area with the most number of methods Goal: To discover general rules or concepts from a limited set of examples (common patterns) It is based on the search of similar characteristics among examples All its methods are based on inductive reasoning (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 5 / 28 Inductive Learning Inductive reasoning vs Deductive reasoning Inductive reasoning It obtains general knowledge from specific information The knowledge obtained is new Its not truth preserving (new information can invalidate the knowledge obtained) It has not well founded theory Deductive reasoning It obtains general knowledge from general knowledge The knowledge is not new (it is implicit in the initial knowledge) New knowledge can not invalidate the knowledge already obtained Its basis is mathematical logic (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 6 / 28

Inductive learning Inductive Learning From a formal point of view its results are invalid We suppose that a limited number of examples represent the characteristics of the concept that we want to learn Just only one counterexample invalidates the results But, most of the human learning is inductive! (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 7 / 28 Learning as search (I) Search and inductive learning The usual way to view inductive learning is as a search problem The goal is to discover a function/representation that summarizes the characteristics of a set of examples The space of search is all the possible concepts that can be built There are different ways to perform the search (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 8 / 28 Learning as search (II) Search and inductive learning Space of search: Language used to describe the concepts = Set of concepts that can be described by the language Search operators: Heuristic operators that allow to explore the space of concepts Heuristic function: Preference function that guides the search (Bias) (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 9 / 28

Types of inductive learning Search and inductive learning Supervised inductive learning Each example is labeled with the concept it belongs to Learning is performed by contrast among concepts A set of heuristics allows to generate different hypothesis There is a criteria of preference (bias) that allows to choose the most suitable hypothesis for the examples Result: The concept or concepts that describe better the examples (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 10 / 28 Types of inductive learning Search and inductive learning Unsupervised inductive learning Examples are not labeled We want to discover a suitable way to cluster the objects Learning is based on the discovery of similarity/dissimilarity among examples A heuristic preference criteria will guide the search Result: A partition of the examples and a characterization of the partitions (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 11 / 28 We can learn a concept as the set of questions that allows to distinguish it from others Using a tree as representation formalism we can store and organize these questions Each node from the tree is a question about an attribute The search space is the set of all possible trees of questions This representation is equivalent to a DNF (2 2n ) (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 12 / 28

To reduce the computational cost of searching this space we must to choose a bias (what kind of concepts are preferred) Decision: Tree that gives the minimal description of the goal concept given a set of examples Reason: Such kind of tree will be the better to predict new instances (the probability that unnecessary conditions appear is reduced) Occam s razor: the hypothesis that introduces the fewest assumptions and postulates the fewest entities is to be preferred (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 13 / 28 Algorithms for One of the first algorithms for building decision trees is ID3 (Quinlan 1986) It is in the family of algorithms for Top Down Induction Decision Trees (TDIDT) ID3 performs a search using a Hill-limbing strategy in the space of decision trees For each level of the tree an attribute is chosen and the set of examples is split using the values of the attribute. This process is repeated recursively for each partition The selection of the attribute is performed using an heuristic function (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 14 / 28 Information Theory Information theory studies among other things the coding of messages and the cost of their transmition If we define a set of messages M = {m 1, m 2,..., m n }, each one with probability P(m i ), we can define the quantity of information (I ) that a message M contains as: I (M) = n P(m i )log(p(m i )) i=1 This value can be interpreted as the information needed to discriminate the messages from M (Number of bit necessary to code the messages) (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 15 / 28

Quantity of information as heuristic We can use an analogy from message coding assuming that the classes are messages and the proportion of examples from each class is their probability A decision tree can be seen as the coding that allows to discriminate among classes We are looking for the minimal code that discriminates among classes Each attribute is evaluated to decide if it is a part of the code An attribute is better than other if allows to discriminate better among classes (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 16 / 28 Quantity of information as heuristic At each level of the tree we have to find the attribute that allows to minimize the code (minimizes the size of the tree) The attribute that allows that is the attribute that left less quantity of information to cover by other attributes The election of an attribute should result in subsets of examples that are biased towards one class We need a measure of the quantity of information not covered by an attribute (Entropy, E) (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 17 / 28 Information Gain Quantity of information (X - examples, - classification) I (X, ) = c i c i X log( c i X ) Entropy (A - attribute, [A(x) = v i ] - examples with value v i ) Information Gain E(X, A, ) = v i A [A(x) = v i ] I ([A(x) = v i ], ) X G(X, A, ) = I (X, ) E(X, A, ) (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 18 / 28

Information Gain 1 2 3 I(X,) A=v1 A=v2 A=v3 1 2 3 1 2 3 1 2 3 E(X,A,) G(X,A,)= I(X,) E(X,A,) (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 19 / 28 ID3 Algorithm Algorithm: ID3 (X : Examples, : lassification, A: Attributes) if all examples are from the same class then return a leave with the class name else ompute the quantity of information of the examples (I) foreach attribute in A do ompute the entropy (E) and the information gain (G) Pick the attribute that maximizes G (a) Delete a from the list of attributes (A) Generate a root node for the attribute a foreach partition generated by the values of the attribute a do Tree i =ID3(X (a=v i ), (a=v i ),A-a) generate a new branch with a=v i and Tree i return the root node for a (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 20 / 28 Example (1) Let it be the following set of examples Ex. Eyes Hair Height lass 1 Blue Blonde Tall + 2 Blue Dark Medium + 3 Brown Dark Medium 4 Green Dark Medium 5 Green Dark Tall + 6 Brown Dark Small 7 Green Blonde Small 8 Blue Dark Medium + (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 21 / 28

Example (2) I (X, ) = 1/2 log(1/2) 1/2 log(1/2) = 1 E(X, eyes) = (blue) 3/8 ( 1 log(1) 0 log(0)) + (brown) 2/8 ( 1 log(1) 0 log(0)) + (green) 3/8 ( 1/3 log(1/3) 2/3 log(2/3)) = 0,344 E(X, hair) = (blonde) 2/8 ( 1/2 log(1/2) 1/2 log(1/2)) + (dark) 6/8 ( 1/2 log(1/2) 1/2 log(1/2)) = 1 E(X, height) = (tall) 2/8 ( 1 log(1) 0 log(0)) + (medium) 4/8 ( 1/2 log(1/2) 1/2 log(1/2)) + (small) 2/8 (0 log(0) 1 log(1)) = 0,5 (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 22 / 28 Example (3) We can see that the attribute eyes is the one that maximizes the function. G(X, eyes) = 1 0,344 = 0,656 G(X, hair) = 1 1 = 0 G(X, height) = 1 0,5 = 0,5 (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 23 / 28 Example (4) This attributes generates the first level of the tree EYES BLUE BROWN GREEN 1,2,8 + 3,6 4,7 5 + (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 24 / 28

Example (5) Now only in the node corresponding to the value green we have a mix of classes, so we repeat the process with these examples. Ex. Hair Height lass 4 Dark Medium 5 Dark Tall + 7 Blonde Small (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 25 / 28 Example (6) I (X, ) = 1/3 log(1/3) 2/3 log(2/3) = 0,918 E(X, hair) = (blonde) 1/3 (0 log(0) 1 log(1)) + (dark) 2/3 ( 1/2 log(1/2) 1/2 log(1/2)) = 0,666 E(X, height) = (tall) 1/3 (0log(0) 1 log(1)) + (medium) 1/3 ( 1 log(1) 0 log(0)) + (small) 1/3 (0 log(0) 1 log(1)) = 0 (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 26 / 28 Example (7) Now the attribute with the maximum value is eyes. G(X, hair) = 0,918 0,666 = 0,252 G(X, height) = 0,918 0 = 0,918 (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 27 / 28

Example (8) The resulting tree totally is discriminant. EYES 1,2,8 + BLUE 3,6 BROWN GREEN HEIGHT TALL MEDIUM SMALL 5 + 4 7 (LSI-FIB-UP) Artificial Intelligence Term 2009/2010 28 / 28