Data Mining Systems Development. Arno Siebes

Transcription

1 Data Mining Systems Development Arno Siebes

2 A Variety of Systems Data mining systems development depends on why you develop the system in the first place: Mining Research Systems, e.g., for: - algorithmic research - systems research Applied Research, e.g., tools for bioinformatics Commercial use, e.g., - horizontal systems, i.e., a system that supports a wide variety of algorithms and (data mining) problems - vertical systems, i.e., a system that is completely targeted towards a specific (business!) problem These different applications, obviously, share requirements on the system but also have different requirements. Where they share requirements, the degrees and the priorities can be rather different. Let us list and contrast these requirements. Arno Siebes Data Mining Systems Development (page 2 of 27)

3 Algorithms The need for algorithms varies with the application: Many and easily extended pure research: extensibility is crucial for algorithmic research horizontal commercial systems: under the precondition of manageability. Specially tailored vertical commercial systems: often even so far as having parameters fixed in an automation phase. applied research: clearly, but extensibility is also important. Extensibility can be achieved in various ways: a monolithic system; it means lots of code redundancy and code management becomes cumbersome a plug and play approach using code from various sources and API s (the Kepler approach); a class library approach, only suitable for well-trained programmers, but allows for lots of code sharing (the MLC++ approach) a components based approach: decomposing algorithms in: representation language, search operators, search strategies and quality functions (the Keso approach). Arno Siebes Data Mining Systems Development (page 3 of 27)

4 Data Management: storage The data has to reside somewhere, it could be: a file system: algorithmic research a dbms within the system: note that the intended difference between a file system and a dbms is the presence of search accelerators (indices), special operators and the support of complex data structures. systems research: you want access to all aspects that govern scalability applied research: the data doesn t easily fit into standard (relational) dbmss. a commercial dbms: commercial systems: that is where the data resides! Note that data import facilities into an internal system may speed up processing considerably, but it may not be what your client actually wants. Arno Siebes Data Mining Systems Development (page 4 of 27)

5 KDD Process Support We all know that data analysis is more than just running algorithms on a data set, there is lots of pre and post processing: Data Preparation Data Selection Data Cleaning Data Transformations Experimental Set Up Random subset selection Cross validation Post Processing Result inspection (e.g., visualization) Result exportation Arno Siebes Data Mining Systems Development (page 5 of 27)

6 Data Preparation There are various ways in which one can prepare data: Unix Tools and Scripting Languages algorithmic research: all the flexibility you need systems research: well, that is if there are no interesting research questions involved SQL applied research: if they are well trained in such languages horizontal systems: if they are well trained in such languages Visual Tools in the Interface commercial systems: the other possibilities are simply too difficult for most users applied research and systems research: there are interesting research questions! Arno Siebes Data Mining Systems Development (page 6 of 27)

7 Experimental Set Up Who needs an experimental set up? You may think everyone, but that isn t true Who doesn t? Why? in vertical applications there may be a hidden experimental set up, but the user should not be bothered by it. the users are most probably simply not smart enough to know what they are doing the users won t have the resources to do this Does this mean you don t test? (Shudder!) not necessarily: the system may have been tuned beforehand, while the results are monitored by experts. Arno Siebes Data Mining Systems Development (page 7 of 27)

8 Post Processing Post processing needs vary: None: algorithmic research: the results are only used in the validation of the algorithm Result Inspection: systems research: interesting research questions, e.g., how to visualize sets of association rules applied research and commercial systems: decisions will be made on the basis of these results, thus thorough understanding is necessary Result Exportation: horizontal systems: at least some XML export facility vertical systems: seamless integration with other products. Arno Siebes Data Mining Systems Development (page 8 of 27)

9 Developing Systems: my experiences and plans In the remainder of this talk, I will discuss some of my experiences and plans: The Keso system: system development research some algorithmic research goal a (modest) horizontal system Data Surveyor at Data Distilleries parallel with the Keso development original goal: a horizontal system products: vertical systems for analytical CRM The future: applied research: bioinformatics system development research to support this work Arno Siebes Data Mining Systems Development (page 9 of 27)

10 The Keso System The Keso system is a client server system: Client: Server user interface mining engine mining server (dbms) caching facilities We ll discuss briefly the mining engine and the mining server aspects of Keso Arno Siebes Data Mining Systems Development (page 10 of 27)

11 The Mining Engine One of the aspects of the KESO data mining kernel architecture is that each data mining algorithm consists of four components: 1. A model representation language. 2. A (local) quality function: which of these models fit the database best? 3. A search strategy: exhaustive or heuristics driven search for good/best models 4. Search operators: that define how a search strategy goes through the space of all possible models Arno Siebes Data Mining Systems Development (page 11 of 27)

12 The Implementation In the architecture this is reflected as follows: Search Man. Search Space Maintainer Qual. C. Model Gen Data Base Arno Siebes Data Mining Systems Development (page 12 of 27)

13 Components Search Manager: Contains a number of Search Modules. Each implements a Search Strategy Description Generator: Contains a number of operator modules. Each module implements an operator on a specific description language. Quality Computer: Implements a number of quality function modules and has a separate module to query the database for cross-tables. Arno Siebes Data Mining Systems Development (page 13 of 27)

14 Search Space Manager: All communication between the components is through records in the Search Space Manager. Some fields in these records are: An oid (logical name of ϕ) The description (ϕ itself) The quality of ϕ The operator o with which ϕ is constructed The parent model, i.e., ψ, with o(ψ) = ϕ While the search space is explored, these records are stored in a database. Arno Siebes Data Mining Systems Development (page 14 of 27)

15 Lessons Learned: The Search Space Manager Why did we decide to store the explored part of a search space? it allows users to see the route that the (heuristic) search has taken and explore other avenues from arbitrary points in that space it can be used as a caching mechanism, for points that you already visited you don t have to repeat the quality computation. Caching: it is a nice idea, but not that practical. The search space grows quickly. Maintaining a good index structure is expensive and linear search becomes expensive quickly as well Conclusion: it is better to speed-up the quality computation (i.e., database access). Exploring the Search Space: this is a good idea, but requires good user interface support. Experience learns that it is not necessary (not to say an overkill) to store all points, it is far better just to store those points that were considered interesting during the search and do some recomputations if necessary. Arno Siebes Data Mining Systems Development (page 15 of 27)

16 Lessons Learned: The Components Approach Why did we choose for a components based approach? extensibility through component sharing localised database access (quality computation) Extensibility: search strategies: clearly true operators and quality functions: depend strongly on the modelling language. Flexibility within one language is good. Localised database access: good results, especially because of uniform database access (through data cubes). This allows for all kinds of optimization. Arno Siebes Data Mining Systems Development (page 16 of 27)

17 The Mining Server At CWI we have a reasonably powerfull SGI Origin 2000 SMP machine: /300 MHz MIPS and 64GB main memory. It is nice to exploit such hardware fro the mining server. The possibilities lie in a few aspects of the Keso system: KESO use of the main memory DBMS Monet (a CWI/ UvA research prototype) The fact that the search of many data mining algorithms has a zooming type of behaviour The fact that quality functions are based on relatively simple aggregates, such as count and sum The fact that these aggregate queries are submitted in batches of related queries. Arno Siebes Data Mining Systems Development (page 17 of 27)

18 Monet Monet is a main memory DBMS, i.e., it is assumed that the database hotspot will always fit in main memory. An important aspect for us is that the data is stored as columns (per Attribute) rather than as rows (per tuple) as is customary. Each attribute is stored in a BAT (binary association table) as oid-value pairs. Data mining algorithms tend to investigate mostly one and at most a few attributes at the same time. In the case of the classification trees, one considers the qualities of the splits per attribute: this ensures that the hotspot fits it allows parallelization: - distribute the attributes over the processors - distribute each attribute over the processors Arno Siebes Data Mining Systems Development (page 18 of 27)

19 Zooming Consider again the construction of a classification tree. Each node n in the tree describes a subset s n of the database: If n is a child of m, s n is a subset of s m In other words, if we store the select set of the parent, we only have to scan that part of the database for the evaluation of the candidate sons: this requires the extra storage of one column only the optimization in time is exponential in the depth of the tree. Similar observations hold for many other mining algorithms. Arno Siebes Data Mining Systems Development (page 19 of 27)

20 The Aggregates In a client-server architecture (such as KESO), it is beneficial if the aggregates on which the quality functions are based are computed in the server rather than in the client. From a parallelization point of view, these aggregates have the nice property that they distribute: the count of the whole select set is the sum of the counts of a partition of the select set the sum is the sum of the sums Arno Siebes Data Mining Systems Development (page 20 of 27)

21 Multi-query optimization Traditionally, a DBMS optimizes each query posed to it separately. It does not take other queries in the system into account. While constructing a classification tree, we consider all possible extensions at the same time. This leads to a batch of strongly related aggregate queries to the database. Optimizing the combined set of queries (which is a.o., possible because of the distributive nature of the aggregates) leads to far better response times than the set of optimized queries. To do this, a special optimizer (called the MIL squeezer) has been implemented in Monet. Note, normal query optimization is still very much part of the process, of course. Arno Siebes Data Mining Systems Development (page 21 of 27)

22 Lessons learned Using our own DBMS was a major asset in the work: all unnecessary components such as locking and transaction amangement in general could be left out. operators such as the datacube could be implemented directly in the core of the DBMS all optimization tricks could be built in directly. all of this results in far greater speed than commercial platforms such as Oracle can deliver. Moreover, it means that new data structures etc, are easily supported. In particular, joint work with Arno Knobbe (syllogic) shows that the whole framework can be easily generalized for multi-relational mining. If you want a drawback: stability. Arno Siebes Data Mining Systems Development (page 22 of 27)

23 Experiences at DD As noted before, the development at Data Distilleries went in parallel with the development of the KESO system (in which they were a part of the team together with UH and GMD) Their technical experiences are in part what I ve told you. But there are some interesting business experiences: User Interface: the importance of this aspect cannot be overestimated. The interface should support the whole process in an intuitive fashion. E.g., SQL is far too difficult for many users. Server: exploiting an internal DBMS is ok, but you should import data from a wide variety of sources. speed is not that important, other aspects dominate Extensibility: Most clients are interested in solving business problems, i.e., require vertical systems. It is not the range of algorithms but the suitability of the algorithm for that purpose that counts. This also means that exporting results has to integrate with other products. Arno Siebes Data Mining Systems Development (page 23 of 27)

24 The Future I have learned a lot from the development of the Keso system, and I will start developing a new one at Utrecht University. again we will use Monet again we will think components based But, there is also going to be a difference: it is going to be an applied research vehicle for bioinformatics (in particular for genomics and proteomics) That is, it is going to be biologists workbench: including a dbms for biological data supporting mining of biological data Arno Siebes Data Mining Systems Development (page 24 of 27)

25 Biomolecular Data Biomolecular data is far from straight forward, some main reasons are: During inheritance DNA sequences are changed (both through cross-over and mutation). only parts of DNA strings encode for genes that may be turned on or off or even be silenced. The rest is called junk DNA... from the transcribed RNA large parts are introns that are subsequently removed only part of the mrna string is translated into protein because of the redundancy of the genetic code, different mrna substrings may yield the same protein. subtly different protein molecules may act completely interchangeable, because large parts of the protein simply guarantee the required geometry of the more active parts of the molecule. Arno Siebes Data Mining Systems Development (page 25 of 27)

26 Searching Because of this approximate nature of the strings, searching databases is not: the standard database equality search not even exact string matching It is an alignment problem with penalty functions for the errors There are a number of algorithms for this problem: using on dynamic programming: find the best (partial) alignment between two strings using Hidden Markov Models: how likely is one string produced by a machine that could have produced another string? Moreover, such algorithms have been transfered to a database setting in programs as BLAST and FASTA However, these tricks are far from the regular database approach, i.e., indices. One of the goals will be to built index structures in Monet targeted towards alignment searches. Arno Siebes Data Mining Systems Development (page 26 of 27)

27 Mining Biomolecular data is a paradise for data miners. There is not enough theory to attack the (enormous amounts of) data in a traditional way. Examples of questions are: discovering genes (there are approaches using HMMs, but more traditional techniques such as classification trees are also used). building philo-genetic trees (i.e, tracing back evolution) function prediction for proteins uncovering metabolic pathways Arno Siebes Data Mining Systems Development (page 27 of 27)