DATA MINING - SELECTED TOPICS Peter Brezany Institute for Software Science University of Vienna E-mail : brezany@par.univie.ac.at 1
MINING SPATIAL DATABASES 2
Spatial Database Systems SDBSs offer spatial data types (e.g., points, lines, regions, etc.) in their data model and query language. Important query types: region queries, nearest neighbour queries, and spatial joins. Spatial data => data related to space. The space of interest: earth surface, VLSI design, 3D model of the human brain, 3D arrangement of chains of protein molecules. 3
KDD in Spatial Databases Extraction of implicit knowledge, spatial relations, or other patterns not explicitly stored in spatial DBs. Algorithms for KDD in spatial DBs consider the relevant neighbourhood of the DB objects and their interaction with each other. A promising field with fruitful research results and many challenging issues. 4
Spatial KDD System User Controller Spatial Database SDBMS Focus Data Mining Evaluation Discovered Knowledge Domain Knowledge Knowledge Base 5
Spatial DB Operations for KDD standard operations/queries ( region queries,...) special operations/queries (see below) D north A A disjoint B A overlap B B inside A D B B east A A dist=0 B A dist=c B A dist<c B A C C southeast A 6
Methods for Spatial KDD Characterization (Generalization) - finding a high concept description from detailed data. Clustering - grouping the objects using similarity. Exploring spatial associations - discovering the rules that associate one or more spatial objects with other spatial objects. Classification - assigning objects to a given set of classes. 7
Spatial Characterization The existence of background knowledge in the form of concept hierarchies is needed. High-level precipitation concepts very dry dry moderately dry fair moderately wet wet very wet [0-0.1] (0.1-0.3] (0.3-1.0] (1.0,1.2] (1.2-2.0] (2.0-5.0] 5.0 & up A year-season-month hierarchy year spring summer autumn winter Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec. Jan. Feb. 8
Spatial Characterization (cont.) extract region from precipitation-map where province = B.C. and period = spring and year = 1999 in relevance to precipitation and region moderately dry very dry dry very wet wet m.w. fair 9
Towards Parallel Spatial KDD Many real spatial DBs are getting huge and their complexity is increasing ==> more computing resources needed for KDD CERN (HEP) : 5 petabytes/year, 1750 scientists, 150 institutes, 32 countries Medical imaging: the size of a digitized slide = 7 GB, 1000 slides/day Parallel processing is needed 10
COMMERCIAL DATA MINING SYSTEMS 11
Examples of Commercial Data Mining Systems Many DM systems specialize in one data mining function, such as classification, or just one approach of a data mining function, such as decision tree classification. Other systems provide a broad spectrum of data mining functions. Below we introduce a few systems that provide multiple data mining functions and explore multiple knowledge discovery techniques. Prices: Usually above 1 mil. ATS 12
Examples of Commercial Data Mining Systems (2) Intelligent Miner (IBM) : association, classification, regression, predictive modeling, deviation detection, sequential pattern analysis, and clustering, application toolkit containing neural networks algorithms, statistical methods, data preparation tools, and data visualization tools. Enterprise Miner (SAS Institute) : association classification, regression, predictive modeling, deviation detection, and clustering, a variety of powerful statistical analysis tools, which are built based on the long history of SAS in the market of statistical analysis. MineSet (Silicon Graphics (SGI)) : a distinguishing feature is its set of robust graphics tools using powerful graphics features of SGI computers. 13
Examples of Commercial Data Mining Systems (3) Clementine (Integral Solutions (ISL)) : A distinguishing feature of Clementine is its object-oriented, extended module interface, which allows users algorithms and utilities to be added to Clementine s visual programming environment. DBMiner (DBMiner Technology) : Multiple datamining algorithms + OLAP analysis. There are many other commercial data mining products, systems and research prototypes that are also fast evolving. 14