QUT Digital Repository:

Size: px
Start display at page:

Download "QUT Digital Repository: http://eprints.qut.edu.au/"

Transcription

1 QUT Digital Repository: Nayak, Richi (2008) XML Data Mining: Process and Applications, in Song, Min and Wu, Yi-Fang, Eds. The Process and Applications of XML Data Mining. Idea Group Inc. / IGI Global. Copyright 2008 Idea Group Inc. / IGI Global This chapter appears in "The Process and Applications of XML Data Mining" > edited by Min Song and Yi-Fang Wu > Copyright 2008, IGI Global, Posted by permission of the publisher.

2 THE PROCESS AND APPLICATIONS OF XML DATA MINING Richi Nayak Faculty of Information Technology Queensland University of Technology S 835, Gardens Point, GPO Box 2434, Brisbane, QLD 4001, Australia Ph: Fax: r.nayak@qut.edu.au Keywords. Data mining, XML, XML structure mining, XML content mining 1

3 The Process and Applications of XML Data Mining Richi Nayak Faculty of Information technology Queensland University of Technology, Brisbane, Australia ABSTRACT XML has gained popularity for information representation, exchange and retrieval. As XML material becomes more abundant, its heterogeneity and structural irregularity limit the knowledge that can be gained. The utilisation of data mining techniques becomes essential for improvement in XML document handling. This chapter presents the capabilities and benefits of data mining techniques in the XML domain, as well as, a conceptualization of the XML mining process. It also discusses the techniques that can be applied to XML document structure and/or content for knowledge discovery. INTRODUCTION The Web is an immense and dynamic collection of pages and services that includes countless hyperlinks, thus, it provides a rich and diversified data mining source. Currently, the majority of this information is in Hyper Text Markup Language (HTML). HTML tags are primarily formatting markup and were designed to convey technical reports. Some internal structural information can be inferred from them, (e.g. <h1> indicating important information) but they hold no semantic information regarding content. With an increasingly distributed corporate world and progression to Web 2.0, HTML is considered an inferior means of data exchange. To overcome these limitations, XML (extensible Markup Language) uses custom-defined tags to describe the data and the structural relationships of data within a document. XML is a subset of SGML (ISO 8879) and is defined by the World Wide Web Consortium (W3C) (Yergeau et al., 2004). XML tags describe the structural and semantic meaning of information in text documents thus make the XML documents semi-structured and self-describing. XML is rapidly becoming the standard for exchanging and representing data. Many information sources have already or are beginning to structure their external view as a repository of XML documents, regardless of their internal storage mechanism. As XML data becomes more abundant, the ability to gain knowledge from XML sources decreases due to their heterogeneity and structural irregularity. Several advanced data processing techniques are required to retrieve and analyse such large amounts of semistructured data. Automatic storage of XML documents in the form of relational or objectoriented data has been actively studied by database researchers (Abiteboul et al., 2000) (Lee et al., 2002). Other researchers have successfully stored XML documents in native XML databases (Pardede, 2006). Consequently, several query languages for various XML data sources have been developed (Boag et al.). The use of these query languages is limited, for example, users need to know what kind of information is to be accessed and only limited inputs and outputs are acceptable. Additionally, indexing based on structural similarity and/or based on groupings of XML documents sharing frequent sub-structures are needed to support effective document storage and retrieval (Nayak et al., 2002). Data mining techniques such as clustering (Jain et al., 1999) can improve XML document storage and retrieval by grouping XML documents according to their structural similarity. Computation of structural similarity is also a great value in managing the Web data. Many techniques of extraction and integration of relevant information from the Web data sources require grouping the Web data sources according to their structural similarity (Flesca et al., 2

4 2005). Moreover data mining (Fayyad, 1995) techniques allow the user to search for unknown facts, information that is hidden behind the data, and also allow users to pose more complex queries. For example, after identifying similarities among various XML documents using clustering, links between tags within a group of XML documents can be analysed using association mining. This may prove useful in analysis of e-commerce web documents and subsequently in personalisation of web pages. There is a considerable body of research on mining useful information from numerical, symbolic and text data (Han & Kamber, 2001). There have been some progress on using XML as a language in data mining process models such as (1) Predictive Model Markup Language (PMML) (Wettschereck, 2001) for utilizing XML to specify several kinds of data mining models, (2) XML based Data Mining Specification Language (XDMSL) for describing the data mining process (Kotásek & Zendulka, 2002) and (3) Log Markup Language for utilizing XML to structurally express the contents of Web server log files(punin et al., 2001). Research on developing data mining techniques for XML documents is gaining momentum (Nayak & Zaki, 2006). The characteristic of XML that adds semantic and structural aspects to document contents offers new data mining opportunities. At the same time, this also makes the data mining process challenging by including the semantic and structural aspects into analysis. Given the irony that humans produce far more data than they can ever analyse, the development of XML mining techniques must keep pace with the development and implementation of XML technology itself. This chapter is motivated by the potential of these two mutually beneficial technologies. It first briefly describes the XML data and the equivalent tree representation. It then presents a classification of XML mining methods, a discussion of mining applications such as classification, clustering and association followed by a summary of tools and techniques that can be successfully applied to the content or structure of XML documents for knowledge discovery. This chapter provides an up-to-date survey of XML mining and will include both academic efforts and commercial offerings. REPRESENTATION OF XML DOCUMENTS This section provides background information on XML. Let all textual Web objects be the set T. Let web pages containing XML - to be called XML data - be X, such that X T. There are two types of XML data: XML documents and XML schemas. A XML schema provides the data definitions and structure of the XML document (Abiteboul et al., 2000). XML documents are the instances of a schema, a snapshot of what the document may contain. A schema includes allowable elements and attributes and the number of occurrences of elements and other constraints. A schema for a document may be included as both internally and externally (within the same file or in a different file, respectively). In a heterogeneous and flexible environment such as the Web, it cannot be assumed that each XML document has a schema defining its structure. Additionally even if such exists, it may have undergone multiple modifications. Consequently, all XML or Web data cannot be automatically classed as XML documents. Strictly, web data or XML data are classed as XML documents only if they are well-formed. To be well-formed, a page s XML must have properly nested tags, unique attributes (per element), one or more elements and only one root element, as well as a number of schema-related constraints. Well-formed documents have a schema but may not conform to it. Valid XML documents are a subset of well-formed XML documents. A valid XML document must additionally conform (at least) to an explicitly associated schema. Figure 1 depicts the various types of XML data and how they are related. 3

5 T: textual web data, X: XML data, : ill-formed XML data, W: well-formed XML documents, V: valid XML documents Figure 1: Relationship between various XML data <?xml version= 1.0 encoding= UTF-8?> <BookStore> <!DOCTYPE BookStore [ <Book> <!ELEMENT BookStore (Book+)> <Title> Introduction of XML </Title> <!ELEMENT Book (Title, (Author)*, <Author> ISBN, Publisher)> <fname> Smith </fname> <!ELEMENT Title (#PCDATA)> <lname> Andrew </lname> <!ELEMENT Author(fName,mName?,lName)> </Author> <!ELEMENT ISBN (#PCDATA)> <ISBN> </ISBN> <!ELEMENT Publisher (#PCDATA)> <Publisher> McGraw-Hills </Publisher> ]> </Book>.. </BookStore> Figure 2: Example of a XML document and its respective DTD <?xml version="1.0" encoding="utf-8"?> <xsd:schema xmlns:xsd= targetnamespace= elementformdefault="qualified"> <xsd:element name="bookstore"> <xsd:complextype> <xsd:sequence> <xsd:element ref="book" minoccurs="1" maxoccurs= "unbounded"/> </xsd:sequence> </xsd:complextype> </xsd:element> <xsd:element name="book"> <xsd:complextype> <xsd:sequence> <xsd:element ref="title" minoccurs="1" maxoccurs="1"/> <xsd:element ref="author" minoccurs="1" maxoccurs="unbounded"/> <xsd:element ref="isbn" minoccurs="1" maxoccurs="1"/> <xsd:element ref="publisher" minoccurs="1" maxoccurs="1"/> </xsd:sequence> </xsd:complextype> </xsd:element> <xsd:element name="title" type="xsd:string"/> <xsd:element name="author" type="xsd:string"/> <xsd:element name="date" type="xsd:string"/> <xsd:element name="isbn" type="xsd:string"/> <xsd:element name="publisher" type="xsd:string"/> </xsd:schema> Figure 3: Example of the respective XSD For the above document 4

6 There are several XML schema languages that allow the structure of XML documents to be described and their contents to be constrained 1. Only two are commonly used, namely DTD (Document Type Definition) and XML Schema or XML Schema Definition (XSD). The DTD language is considered limited as it only supports a limited set of data types, has loose structure constraints and limits content to textual. To overcome the above limitations of DTD, XSD provides features, such as simple and complex types, rich datatype sets, occurrence constraints and inheritance. An XML schema is usually comprised of a set of schema components, such as type definitions and element declarations. They can be used to assess the validity of well-formed elements. It is believed that XSD with its flexibility will soon more popular than DTD 2. Throughout this chapter, the term schema is used to express both XML-DTD and XML-Schema unless specified. The term XML data is used to express both XML documents and XML schemas. Figure 2 illustrates an XML document and its corresponding DTD. Figure 3 shows its respective XML Schema. XML MINING: TAXONOMY For several years data mining (DM) has been used to extract meaningful knowledge from large amounts of data. Mining of XML documents differs significantly from that of numerical, symbolic and text data. XML mining is the use of DM techniques to automatically discover and extract information from sources of XML documents. The fact that data is represented in hierarchical format in XML documents poses a challenge for DM. Moreover, XML documents can be designed with many flexibilities and minimal restrictions. Many see this as one of the greatest strength of XML, however, this makes the process of document handling difficult. Consider parts of two documents: <craft>boat building</craft> and <craft> boat </craft>. The intended interpretation of the former is occupation, and of the latter vessel. Similarity of the content does not distinguish the semantic intention of the tags. These two fragments will be found to be very similar based on words common to the two sets {craft, boat, building, craft} and {craft, boat, craft}. Use of structure mining in this case provides the probability of a tag s having a particular meaning. For example, a mining rule inferred from a collection of XML documents is 80% of the time, if an XML document contains a <craft> tag then it also contains a <driver> tag. Such a rule now helps determine the appropriate interpretation for such homographic tags. Hence, mining for the structure and content of documents can clarify when two similar documents are actually completely different, given homograph tags. There are many benefits and applications that can be obtained with the utilisation of XML data mining techniques such as: Enhancing information sharing among various industries and government by proposing techniques for organizing and integrating various heterogeneous and distributed XML documents. Improving the accuracy and speed of the XML-based search engines in retrieving the relevant portions of data (1) by suggesting XML documents according to the similarity of their structure and content, and (2) by discovering the links between XML tags that occur together within the XML documents. For example, a DM rule can discover that <telephone> tags must appear within <customer> tags from a collection of XML documents. This information can be used by searching only <customer> tags when executing a query about finding <telephone> details thus making the information retrieval efficient. 1 XML Schema: 2 Introduction to XML Schema by Refsnes Data: 5

7 Improving XML document handling and achieving efficient searches on relevant documents by using the developed set of predefined document classifications. Better representation of information provided in Web sites with better restructuring by recommending (1) Web links that occur together; and (2) Web documents that are similar in structure and content. Mining of XML documents differs significantly from other structured data. XML mining includes mining of structures as well as content from XML documents (Nayak et al., 2002), depicted in figure 4. Mining XML content is generally carried out in the context of known XML structure, possibly determined by XML structure mining. Content mining may, however, also play a role in clarifying XML structure. Therefore to avoid information loss, the structural and contextual data are frequently combined for the best use of XML documents. XML Mining XML Structure Mining XML Content Mining Intra- Structure Mining Inter- Structure Mining Content Analysis Structure Clarification Figure 4: A taxonomy of XML Mining Next both structure and content mining is discussed with the application of data mining operations such as classification, clustering and association, and the type of XML material available for input to these procedures. Technical details of measurements such as criteria for classification or similarity metrics for clustering will not be covered in this section, since the main objective is to establish usage and benefit of data mining in XML. XML Structure Mining XML is semi-structured data thus mining for XML structure provides insights. Element tags and their nesting therein dictate the structure of an XML document (Yergeau et al., 2004). For example, the textual structure enclosed by <author> </author> is used to describe the author tuple and its corresponding text in the document. Tags in XML are user-defined and describe the area of interest. For example, <manufacturer>, <model>, and <colour> tags can be used to describe the car information for the automobile industry. Since XML provides a mechanism for tagging names with data (to describe the data), retrieval of more accurate information on XML documents structure can be facilitated with the use of data mining. XML structure mining is essentially mining for schema including intra-structure mining, and interstructure mining. Intra-structure Mining Intra-structure mining is concerned with the structure within an XML document(s). Knowledge is discovered about the internal structure of XML documents. The classification task of data mining can be applied to map a new XML document to a predefined class of documents. XML document structure can be read directly or via the document s schema. A document schema provides a definitive description of a document, while a document instance only shows the content of the document. Because the document 6

8 definition outlined in a schema holds true for all document instances of that schema, the result produced from classifying schemas would also hold true for all document instances of the classified schemas and can be reused for any other instances of these schemas. A schema is interpreted as a description of a class of XML documents. Let us assume that each document is accompanied by a schema. In the absence of a schema, a XML document is parsed and the structure is extracted and modelled. Given a collection of schemas as a training set, the objective of this task is to classify new XML schemas according to this training set of schemas. Both the semantic and structural similarities are considered in classifying a schema into a class. This task is most easily performed on valid XML documents. With schemas already defined for the new XML document, the classification task can proceed by comparing the classification schemas with the new schema. For any XML document with an associated schema, it should first be validated. It is important to distinguish between valid XML and well-formed XML with incorrectly associated schema. For well-formed XML, an attempt is made to parse the documents according to the classification schema. A successfully parsed document is classified as an instance of the relevant schema. Ill-formed XML with associated schemas may also be classified if enough of the document is parsed before an error occurs. Then the classification could be used to rescue any potentially valuable information. The task will be most difficult (but still possible) for XML with no associated schemas. In this case, the similarity will be found between the classification schemas (classes) and the document structure. The clustering task of data mining can be used to identify similarities among XML documents. The structure of each XML document is inferred and modelled as a labelled tree. Each node in the tree has information about that element, e,g, name, cardinality, position etc. A clustering algorithm takes a collection of trees and groups them on the basis of semantic and structural similarity. These similarities are then used to generate new schema. As a generalisation, the new schema becomes a superclass to the training set of schemas. This generated set of clustered schemas now can be used in classifying new schemas. The superclass schema can also be used in integration of heterogeneous XML documents for each application domain. This allows users to find, collect, filter, and manage information sources on the Internet more effectively. The association rules discovery task of data mining can describe relationships between tags which occur together in XML documents. A XML document/schema can be represented as a tree structure. Each tree branch (or path) is considered a transaction. By transforming the tree structure of XML into pseudo-transactions, it becomes possible to generate rules of the form if an XML document contains a <craft> tag then 80% of the time it will also contain a <driver> tag. Such a rule is then applied in determining the appropriate interpretation for homographic tags (wherein words which are like one another in form have distinctly different meanings). Inter-structure Mining Inter-structure mining is concerned with the structure between XML documents. Knowledge about the relationship between subjects, organizations and nodes on the Web is discovered. Clustering schemas involve identifying similar schemas according to the linguistic and hierarchical closeness. The clusters are used in defining hierarchies of schemas. The schema hierarchy overlaps instances on the web, thus discovering authorities and hubs (Garofalakis, 1999). Creators of schemas are identified as authorities, and creators of instances are hubs. Additional mining techniques are required to identify all instances of schemas present on the web. The following application of classification can identify the most likely places to mine for instances. Classification is applied with namespaces and URIs (Uniform Resource 7

9 Identifiers). Having previously associated a set of schemas with a particular namespace or URI, this information is used to classify new XML documents linked with this URI. XML Content Mining Content is the text between each start and end tag (Yergeau et al., 2004) in XML documents. Mining for XML content is essentially mining for values (an instance of a relation). The semistructured nature of XML poses a challenge for content mining. XML content mining can further be divided into two tasks: content analysis and structural clarification. Content Analysis Data mining of flat text files has been successfully conducted as the content of the text files is treated as a bag of words or terms. Tasks similar to those performed on other text documents can be performed on XML documents. However, XML represents its data in a hierarchical structural format that makes content analysis harder than it is for plain text. One has to consider the granularity and the need for indexing at various levels of abstraction (e.g., whole XML documents vs. parts of XML document) in mining. Classification is performed on XML content, labelling new XML content as belonging to a predefined class. A massive search would be required to match the contents of a new XML document with every document in the collection.. To reduce the number of comparisons, firstly, the schema of a new document is classified by a pre-existing schema. Then, only the instance classifications of the matching schema need to be considered in classifying a new document. Clustering on XML content identifies the potential for new classifications. Consideration of schemas leads to a fast clustering process: similar schemas are likely to have a number of value sets. For example, all schemas concerning vehicles will have a set of values representing cars, another set representing boats, etc. However, schemas that appear dissimilar may have similar content. Mining XML content inherits some problems faced in text mining and analysis. Synonymy and polysemy can cause difficulties, but the tags surrounding the content can usually resolve ambiguities. Structure Clarification Content provides support for alternate clustering of similar schemas. Content may prove important in clustering schemas that appear different but have instances with similar content. Due to heterogeneity, the occurrences of synonyms increase. Mining these schemas provides information such as: Are separate schemas actually describing the same thing, only with different terms? While thesauruses are vital, it is impossible for them to be exhaustive in the English language, let alone be so in all languages. Vice versa, schemas provide support for alternate clustering of content. Two XML documents with distinct content may be clustered together given that their schemas are similar. Schemas appearing similar are actually completely different, given homographs. For example, consider: <craft>boat building</craft> and <craft>boat</craft>. Interpretation of the former is occupation and of the later vessel. The similarity of content does not distinguish the sematic intention of tags. Mining in this case provides probabilities of a tag s having a particular meaning, or a relationship between meaning and a document. XML MINING: PROCESS The XML mining process combines the pre-processing, pattern discovery and postprocessing. Pre-processing the XML data infers relevant XML structures and contents from the specific resources. For pattern discovery, application of classification, clustering and 8

10 association mining techniques to pre-processed data identifies interesting information. Lastly, the mined patterns are validated and interpreted in the post-processing phase. Pre-processing: Inferring XML Structure The main goal of pre-processing is to successfully infer structures from XML documents, so a DM technique can identify interesting patterns. The output of this process is mostly a tree or a graph representation that yields the structure of the document or schema. It is not mandatory for an XML document to have a schema that defines its data and structure. A schema describes the grammar of an XML document and allows the document to be parsed. XML documents are classified as ill-formed, well-formed or valid according to their structure. Based on this classification, there are two cases of inferring structures: one is from wellformed or valid documents and another is from ill-formed XML documents. Inferring structure from Well-Formed or Valid XML Documents Given the schemas attached to the well-formed or valid documents, the structure of these documents can be easily inferred by traversing the document. The inferred structure can be represented in tree format, or a relational representation of the data can be created. The structure can be presented as a table with relational attributes to contain the embedded data. If the hierarchy of the attributes is deeper then database techniques such as the addition of more relations and foreign keys and/or normalization methods can be used to accommodate the structure and the data. The structure can be inferred most easily from valid XML documents. For a well-formed XML document, it is necessary to check the validity of the document with respect to its associated schema, in case an inappropriate schema is defined. A variety of XML tools, known as validating parsers, have been developed to verify the conformity of well-formed XML documents with their schemas. Moreover, the well-formed documents may not always have an accompanying schema since the presence of a schema is not mandatory. Schema extraction tools are able to generate schemas from the semantic structure of these documents DTD Generator is a commonly used tool to generate the DTD for a given XML document (Kay, 2000). It identifies a DTD for every XML document hence a separate set of rules for each XML document in a collection of documents is defined rather than an overall set of rules for the collection. Tools such as XTRACT (Garofalakis, 2000) and DTD-Miner (Moh, 2000) infer an accurate and semantically meaningful DTD schema for a given collection of XML documents. These tools require a relatively homogeneous collection of XML documents. In such heterogeneous and flexible environment as the Web, it is unreasonable to assume that XML documents related to the same topic have similar document structure. Due to limitations in using DTDs as an internal structure, many researchers propose the extraction of XSD (Feng, 2002) (Vianu, 2001). XSD is also not obligatory in XML documents hence extraction of structure information from XML documents is necessary to create the XML Schema. A XML schema extraction algorithm based on the Extended Context-Free Grammars (ECFG) with a range of regular expressions is proposed (Nestorov, 1999). A semantic network-based design is also presented to convey the semantics carried by the XML hierarchical data structures and to transform the model into an XML Schema thus increasing user understanding of the documents semantic structure and content as well as the relationships within them (Feng, 2002). Inferring structure from Ill-Formed XML Documents In practice, XML documents often have no schema, and no fixed or rigid structure. Schema for such ill-formed XML documents can be inferred by applying the structure extraction approaches developed for semi-structured documents but not all techniques can effectively 9

11 infer the structure required by further DM algorithms. They do not include the necessary granularity, the various levels of abstractions and the nesting of tags For instance, the NoDoSe tool (Adelberg, 1998) is primarily used for determining the structure of semistructured documents, and it does not support hierarchy as in XML. The extraction algorithms proposed by (Myaeng, 1998) consider both structure and contents in semi-structured documents, however, their purpose is to query and build an index. They are difficult to use and must be altered and adapted prior to the application of data mining algorithms. For extraction of structures from an ill-formed XML document, the Object Exchange Model (OEM) data and its corresponding data graph can produce the most specific (accurate and concise) data guide/schema (Nestorov, 1999) (Wang, 2000) (Nayak et al., 2002). The TreeSketch and XSketch methods facilitate query processing by extracting structural summaries (Polyzotis et al., 2004). In summary, these methods rely on a generic graphsummarization model, which captures the basic structure of XML documents, augmented with appropriate distribution information at different levels of granularity. Such methods are more applicable than DTD/XSD since most XML documents have no schema and may not conform to it if they do. Some semi-structured data are the result of queries. In such cases it is possible to derive the structure from the query that generated the data and doing so is a better choice than extracting the schema from the data. Pre-processing: Inferring XML Content To discover knowledge in XML documents, it is necessary to query XML tags and content and several query languages, either designed specifically for XML or those used for semistructured data in general are available. Query Languages for Semi-structured Data XML represents a subset of semi-structured data. Semi-structured data is described by the grammar of ssd-expressions (semi-structured data). The translation of XML to an ssdexpression is easily automated (Abiteboul et al., 2000). Figure 5 shows an XML description of a person object and an equivalent ssd-expression. Query languages for semi-structured data exploit path expressions. In this way, data can be queried to a variable depth. Path expressions are elementary queries that return the results as a set of nodes. However, results must be returned as semi-structured data and path expressions alone cannot do this. Combining path expressions with SQL-style syntax provides greater flexibility in testing for equality, performing joins, and specifying the form of query results. Two such languages are Lightweight Object Repository (Lorel) (Abiteboul, 1997) and Unstructured Query Language (UnQL) (Fernandez, 2000). Lorel took an object-oriented approach and minimized dependence on predetermined schema information. UnQL relies more on path expressions and requires greater precision. Figure 6 shows a query both in Lorel and UnQL and, as well, it specifies the name of a new node and performs an equality test on the name. XML: <person> <name> Kym</name> <age> 25 </age> <person> ssd-expression: { person : { name : Kym, age : 25 } } Figure 5: An example of XML and ssd-expression Lorel: Select newnode: X UnQL: Select newnode: X From person.age X Where { person: {name: Y, age: X} } in db, Where person.name = Kym Y = Kym Result: { newnode: 25 } Figure 6: A Query written in Lorel and UnQL and its corresponding Result 10

12 XML-QL Query: Where <person> <name>kym</name> <age> $a </age> <person> in db Construct <newnode> <age> $a </age> </newnode> XQuery query: for $b in doc("db.xml") /db/person where $b/name = Kym return <newnode> <age> $b/age </age> </newnode> XSL Quey: <xsl:for-each select = person [name = Kym ] > <newnode> <age> <xsl:value-of select= age /> </age> </newnode> </xsl:for-each> Result: <newnode> <age> 25 </age> </newnode> Figure 7: A Query written in XML-QL, XSL and XQuery Query Languages for XML XML-QL, XSLT, XML-GL, YATL and XQuery are designed specifically for querying XML. XML-QL (Garofalakis, 2000) combines regular path expressions, SQL-style query techniques and XML syntax. Extensible Stylesheet Language Transformation (XSLT) is not implemented as a query language, but is akin to a query in its transformation of XML to HTML and its select pattern mechanism for information retrieval. XML-GL (Ceri, 1999) is a graphical language for querying and restructuring XML documents. YATL is intended to capture a large and useful class of data transformation for querying multiple XML data sources. YATL brings together information from multiple data sources in one query. XQuery (Boag et al.) uses the structure of XML to express queries across several data types, whether physically stored in XML or viewed as XML via middleware. XQuery operates on the abstract, logical structure of an XML document, rather than its surface syntax. These queries produce the output in XML, thus, allow the transformation of XML data from one schema to another. Pattern Discovery: Combining structure and content Many XML data mining techniques mine useful information from the structure and content of XML. The techniques can be divided into three areas: clustering, classification and association. XML Clustering There have been a myriad of techniques developed for finding similarity between documents or schemas. These techniques are used mainly in data/schema integration or query approximation. As well, these techniques facilitate the clustering process. They do by considering the XML semantic information (linguistic and context elements) as well as the hierarchical structure. The process usually starts by representing the XML document or schema into a tree presentation. Semantic similarity measures use acronyms, synonyms, hyponyms, hypernyms of names used to compare corresponding elements in each of the trees and, as well, they consider the hierarchical positions of elements in the tree. Sequential pattern mining algorithms (Agrawal & Srikant, 1996) have been used by many researchers (Nayak, 2007) (Lee & Park, 2004; Leung et al., 2005) to measure structural similarity. These algorithms 11

13 represent a tree by a set of paths/sequences. A path is represented by a unique sequence of element nodes following the containment links from root to leaf nodes. The sequential pattern algorithm computes the maximal similar paths between XML documents. The combination of semantic and structural similarity is represented as a similarity matrix. K-means or hierarchical agglomerative clustering algorithms (Jain et al., 1999) generate clusters of XML documents. Figure 8: A Classification of Similarity Measure Approaches A classification of these approaches is presented in figure 8. The structure-level similarity approaches detect and measure three different sets of data; (1) structural and content similarities between documents (Dalamagas et al., 2004; Flesca et al., 2005; Huang, 1997; Lee et al., 2002; Nayak & Xu, 2006), (2) the structural similarity between documents and schemas (Bertino et al., 2004), and (3) the structural and content similarity between schemas (Nayak, 2007; Nayak & Xia, 2004). The approaches using data from the first and third alternative rely on the notion of tree edit distance developed in combinatorial pattern matching (Zhang & Shasha, 1989). The problem is to compute the minimum distance between two trees T1 and T2 and there are three common editing operations available: changing, deleting, and inserting a node. For each of these operations a cost is assigned and it depends on the labels of the nodes involved. The problem is to find a sequence of such operations (an edit script) transforming T1 into T2 with minimum cost. The distance between T1 and T2 is then defined to be the cost of such a sequence. The use of the second set of data relies on measuring the structural similarity between data and schema in the context of XML. Some of these techniques present documents as edgelabelled graphs, ignoring the constraints on the repeatability, or as element alternatives in XML schemas. Additionally, some techniques cannot be directly applied to cluster documents without knowledge of their schemas, and dissimilarities among documents referring to the same schema cannot be identified. However, these approaches take into account the context of element that strongly contributes to determine which information that element models. The element-level similarity matching approaches known as schema matching determine the semantic correspondence between elements of two schemas. These methods use the document schema to cluster XML documents. Relevant schema information is used to efficiently determine the similarity of corresponding elements in XML documents.. The document schema provides a definitive description of the document, while document instances represent examples of content. The document definition outlined in a schema holds true for all document instances of that schema, hence schema clustering results hold true for all document instances and can be reused for other instances. 12

14 The main difference between element-level matching approach and structure-level matching approach is that in the former, similarity determination is based primarily on elements of the trees, in particular, their semantic names and name structures similarity. On the other hand, structure-level matching determines whole tree structure similarity and ignores detailed elements in the tree. The tree edit problem treats the label of each node in the tree as a second preference. For instance, the cost of relabelling is assumed to be less computationally expensive than that of deleting a node with the old label and inserting a node with the new label. Thus schema matching uses internal tree elements, whereas the tree edit distance approach matches tress at a higher level. The tree-edit distance approach addresses only the existence of different elements in two trees not their cardinality. Researchers have approached schema matching for XML data at three different levels as shown in figure 8. Instance-based matchers use either meta-data and statistical data collected from data instances to annotate the schema or directly correlated schema elements (Kurgan et al., 2002). Instance only level approaches sometimes fail to capture the structure information of the XML data. Machine learning techniques are used to improve accuracy but can be very computationally expensive. Schema-based matchers consider only schema information, not instance data. Schema information includes tag names, descriptions, relationships, constraints, etc. Schema matching at schema only level approaches can be used for mapping a collection of heterogeneous XML-Schemas (Do & Rahm, 2002; Jeong & Hsu, 2001; Lee et al., 2002; Madhavan et al., 2001; Melnik et al., 2002; Nayak, 2007; Nayak & Xia, 2004). However, the absence of instance data can result in increased element mismatch. Therefore the accuracy of the mapping recommended by the schema only level approaches depend on the technique used for linguistic and structure matching. The instance only and schema only level approaches have difficulty finding similar elements between XML documents. Therefore many researchers have combined both the instance and schema information for schema matching (Doan et al., 2001). The instance and schema approaches however need both the XML documents and their associated schema definitions to be available for the mapping. XML Association Mining XML sources are generally represented as an ordered-labelled or unordered-labelled tree. The task is to build up associations among trees (including sub-trees, substructures, sub-graphs and paths) rather than items as in traditional mining. The frequent substructure (tree) mining extracts substructures (sub-trees, sub-graphs or paths) which occur frequently among a set of XML documents or within an individual XML document. These frequent substructures generate association rules. However, the frequent substructures are hierarchical and counting support requires more than just the joining of flat sets. Generation of Frequent substructures: Let CS = {[C]1, [C]2,.. [C]d} be a set of initial candidate substructures sets, where d is the depth of the tree. This is different from traditional association mining (AM) in which there is no predefined candidate set, instead one is generated incrementally by merging elements in the frequent set of the previous round. In this hierarchical structure, a candidate set (CS) already exists. Additionally, in each round, the merging of current candidate sets derives a larger frequent fragment set. The search space for finding frequent structures is much larger than that for traditional association mining data sets thus it requires more effective pruning strategies (to eliminate the candidate item-sets in previous rounds) and merging strategies (to combine candidate item-sets in next round). Researchers have also utilised the mining of closed frequent trees to reduce the number of generated patterns (Kutty, 2007). 13

15 Recently a number of researchers have developed algorithms able to detect frequently occurring substructures from structural data collections. These include AGM, FSG, TreeMiner and gspan (Paik et al., 2005; Zaki, 2002). (Chi et al., 2005) gives a good overview of the frequent tree mining. An issue to consider with these algorithms is that they account for the dynamic nature of the XML data. To overcome this, (Zhao, 2007) have developed a frequently-changing structures mining technique that considers the changing nature of XML data. It aims to extract structures that change frequently from the sequence of historical XML versions. The structure which refers to inserts and deletes and the content which refers to updates of XML documents can change frequently. It is important to understand such changes in different versions of the same document. Many XML DM techniques employ frequent sets in the process of classification of XML data as well as in the process of clustering and association rule generation. Generation of Association Rules: A number of techniques use the expressive power of the query languages to extract association rules (Braga et al., 2002), or rely on the traditional framework with an XML interface (Edmonds, 2005; Kotasek, 2000). This requires user familiarity with the internal structure and content of the documents(s). Examples of user input include the XPath expression selecting the parent nodes of the data items to be mined and XPath expressions relative to that node locating the output and input values (Edmonds, 2005). The XMINE rule operator extract association rules from XML documents using the SQL-like format (Braga et al., 2002). However XML data must be mapped to a relational structure before performing association. This requires powerful pre-processing, and may result in information loss during conversion. (Wan, 2004) used XQuery expressions to extract association rules from XML data and calculate support and confidence. This technique is limited that it fails to account for the structure of the XML data. For more complex XML data, transformation may be required before applying the XQuery expressions. XAR-Miner transforms a small XML document into an indexed XML Tree (IX-tree bidirectional linking between parent and child nodes) and transforms a large XML document into multi-relational databases (Zhang, 2004). XPaths for each relational database are created during data transformation maintaining the hierarchical information in the original XML document. A set of paths between the instances of related concepts are extracted from either the IX-tree or relational database for association rules mining. These paths (known as metapattern) are then generalized, eliminating any unnecessary meta-patterns to maximize the significance of the association rules. Based on this meta-pattern, XAR-Miner generalizes the raw XML data and generates association rules based on the user need using the Apriori algorithm. The generation of association rules from the frequent hierarchal trees remains an unsolved problem. XML Classification Mining The classification task is applicable to a wide variety of problems in XML, however, it has not been studied well. Classification of XML documents requires the identification of structural rules. In the training phase, a set of structural classification rules are built and can be used in the learning phase to classify data of unknown class. The efficiency of existing XML document classification algorithms is limited by their inability to explore the structural information. A few researchers have developed generic classifiers (e.g., information retrieval (IR) based and association based) as well as specific classifiers (e.g. rule based according to structures) for XML. The IR-based methods treat each document as a bag of words. These methods use the actual text of the XML data but not the structural information inside the documents. The association- 14

16 based methods use the associations among different nodes visited in a session in order to perform the classification. An effective rule-based classifier for XML is XRules (Zaki & Aggarwal, 2003), a method that uses a set of structural rules for XML document.classification. XMiner (Zaki, 2002) uses frequent sub-trees in a collection of XML trees to mine a set of rules. In the training phase, it produces a set of structural classification rules that can be used in the learning phase to classify data of unknown class. XRules has shown to provide better XML classifiers in comparison to both the IR and association based classifiers. (Theobald, 2003) explores the structure, annotation and ontological knowledge from XML data to facilitate automatic classification of XML data. It uses the support vector machine (SVM) technique in the training phase in which a set of tags (element name) and text terms are used. This technique computes separating lines (known as hyperplanes) between feature space objects from different classes. These separating lines can be used to test unseen data in the learning phase. This technique is based on the assumption that the tags are more important than text terms in exploiting the structural and ontological information from XML documents. (Edmonds, 2005) uses the traditional framework with an XML interface to pre-process the data for training. It performs a statistical analysis of the pre-processed data and then creates a fuzzy decision tree before converting the result into Metarule format. The mapping of the XML data into a relational structure may result in information loss, and also requires an additional processing. Post-processing: Interpreting mined patterns Post-processing for the discovery of useful knowledge involves the analysis and assimilation of the generated XML pattern models. Due to the variety of tool-specific parameters, the resultant model and its performance must be properly interpreted. The mining model should be visualized in user-friendly fashion. As well, the generated prediction model should be able to classify unseen values using the user s tool. Extensive ongoing research into the postprocessing phase of XML mining aims to improve the usability of data models. The following section identifies the evolution of interpretation approaches. Conventional Approaches In conventional approaches data models generated from a mining algorithm are treated differently depending on the application or mining tool being used. Such tools include OLAP (OnLine Analytical Processing), Relational DB and other data mining specific tools. With the use of these tools, problems occur when complex XML mining implementations are related to different/ XML-enabled databases and different application vendors, such as IBM, Oracle or Microsoft. Each tool has its own post-processing module that it uses to communicate the obtained result. In traditional mining techniques, this limitation exists regardless of the documents or area mined. In other words, there is difficulty in sharing data models obtained from multiple sources. It is necessary to deal with differences between applications and tools in order to share patterns generated from the mining process. However recent developments can output XML patterns in format which allows simpler and more flexible data mining applications. Current Approaches Recently, XML based markup languages that describe the data mining process are employed as part of the data mining post-processing. Discovered patterns can thus be interchanged among conforming data mining and analytical applications. The integrated data mining tools have tremendous potential for expanding the interoperability of the XML documents. 15

17 Recent developments include (1) Predictive Model Markup Language (PMML) (Wettschereck, 2001) which uses XML to specify several kinds of data mining models and (2) Log Markup Language which uses XML to structurally express the contents of Web server log files (Punin et al., 2001). These facilitate integration and analysis of the data collected from various web server log files and allow a better understanding of theuser s web site. PMML (Predictive Modeling Markup Language), introduced by Data Mining Group (DGM), describes the structure and content of data mining models in the format of XML. A set of DTDs included in PMML is used to support several types of data mining models (Wettschereck, 2001). After a discovered XML pattern model is generated by a data mining algorithm, it is stored in the PMML format and thus allows model interchange. By implementing PMML, XML documents from multiple sources can be mined without consideration of differences between those sources and various applications used. XDMSL (XML Data Mining Specification Language) extends the markup language approach to the whole process of knowledge discovery, including the source data model, data transformations, prior domain knowledge, data mining task description and knowledge discovered from data mining task (Kotásek & Zendulka, 2002). Many applications have not standardized the approach of XDMSL. To address this issue, XDMQL (XML Data Mining Query Language) is likely to be used for data exchange between different data mining system components as a part of the XDMSL implementation. The above two languages are platform-independent, extensible and robust and are thus able to support information exchange in heterogeneous and modular environments. COMMERCIAL USE OF XML IN DATA MINING One of XML major advantages is its ability to manage the variety of data sources, types and structures that businesses transfer over the Internet. Despite some differences between the XML data and the typical historical relational data associated with data mining, there is a driving force in using the Internet as a medium for analytical data. XML itself is effective in transmitting and sharing data over the Internet. Companies want to extend this advantage into analytical data as well. Using XML data in the mining process is quite an innovation and is made possible by new web based technologies. XML for Analysis Based on these ideas, XML for Analysis was developed by Microsoft and Hyperion Solutions Corporation in April The specification defines a communication structure for an application programming interface (API) and aims to keep client programming independent of the mechanics of data transport while ensuring adequate information regarding data and proper handling of it. This is platform programming language and data source independent. Simple Object Access Protocol (SOAP) Another technology enabling XML use in data mining is SOAP specifically developed by Microsoft, IBM and Iona. SOAP standardises data access interaction between client applications and analytical providers (data mining and On Line Analytical Processing) over the Internet. SOAP can be described using WSDL (Web Service Description Language), which is the IDL (Interface Definition Language) for web service. WSDL is independent of SOAP, but needed to explain which SOAP messages can be exchanged. The means by which it is discovered is addressed later. Using the SOAP protocol, a server can retrieve information from a client across the web. In doing this, (1) the server side sends several SOAP requests, (2) processes the requests that it receives, (3) finds different patterns, and (4) creates profiles 16

18 based on appropriate limitations or performs appropriate analyses. Ease of use and the platform independence of this protocol are other important factors. Explanation of processes of discovery Universal Description Discovery and Integration (UDDI), developed by Microsoft, IBM and Ariba, uses the XML Schema Language to formally describe its data structures. UDDI is SOAP based and defines global interaction with the web service information repository. A web service is a self-describing, self-contained, modular unit of application logic that provides business functionality to other applications through an Internet connection. The UDDI specification enables businesses to quickly, easily and dynamically find each other and interact. It enables a business to describe it as well as to find and interact with businesses offering desired services. This internet facilitated discovery and interaction fosters new e- business partnerships. UDDI also simplifies the intergradation of disparate systems and allows market expansion, improved efficiency and reduced cost. Applications can access web services via ubiquitous web protocols and data formats, such as XML, without concern re web service implementation. Web services can be mixed and matched to execute a larger workflow or business transaction. UDDI Business Registry can be accessed using SOAP and a service registered in the UDDI Business Registry can expose to any type of service interface. vtag Web Mining Server A product that supports SOAP, WSDL and UDDI is vtag Web Mining Server. This product aims to monitor and mine the web and includes features (Connotate Technologies: vtag, 200), such as: Automatic extraction from HTML, PDF, spreadsheet, and other file formats and conversion to XML. Unlimited 'Information Agents' provide continuous monitoring, extraction and alerting. Scripting, password access, automated parameter entry, and multi-page aggregation. Seamless integration with other applications via Web Services, database delivery, and API programming interface. Agent Repository filters extract and deliver information while instant Web Services create the web services. The information agents are accessed by SOAP and instant Web Services automatically generates WSDL and UDDI.(Connotate Technologies: vtag, 200) Comments The combination of XML and data mining is possible with SOAP since this protocol enables data interaction on the web and therefore data collection. SOAP works optimally in collaboration with WSDL and UDDI. Some efforts have been made to implement these protocols, but in fact the full potential of these technologies has not yet been realised. There is much research in this area and new products are expected. IBM and Microsoft are developing database solutions (Xperanto, IBM/Yukon, Microsoft), which will support both data mining and XML. Since HTTP, XML and SOAP are platform independent, issues associated wth competing proprietary protocols should be resolved. CONCLUSION AND FUTURE DIRECTIONS With the growing importance of XML in document representation, new processing and integration technologies are being devised. The focus of this chapter, however, has been to describe, in general, the capability and benefits of data mining techniques in the XML domain and to conceptualize the XML mining process. This chapter attempts to show the improved knowledge discovery of both structure and content of XML documents with utilisation of data mining techniques. 17

19 This chapter explicitly expresses the representation of XML data and the broad categories of XML mining: XML structure mining and XML content mining. These categories are presented according to data mining tasks such as classification, clustering and association. This chapter then presents the process of knowledge discovery from XML documents summarising the three tasks of clustering, association mining and classification on structure or/and content of XML documents. The chapter further discussed the evolution of knowledge discovery where the current application of XML enables a simplified data mining process and makes the discovered patterns interchangeable among conforming data mining tools and other analytical applications. The chapter then introduces the protocols that support XML and data mining, making data mining possible across the web using XML. XML data mining is a challenging and exciting field with further possibilities. Following are some of the areas identified for future development: Integration of XML Mining The integration of XML, the database languages, such as SQL, and data mining techniques will increase the functionality of relational database products and XML products. It will provide more user friendly mining. The larger RDBMS and data warehouse companies have already expressed an interest in integrating data mining and XML data models into their database products. Graphical user interface Full integration of data mining products with other application tools and the use of GUIs will enhance usability. To satisfy the range of data mining users (from naive to expert users), future work should include mining user graphs that is structural information of web usages, as well as visualization of mined data using systems such as WWWPal system (WWWPAL). Multimedia XML data To perform web content mining, keyword information and content for each of the nodes is required. This information will allow the automatic development of a set of keywords to distinguish text document, multimedia document or other kinds of document based on the contained characteristics such as color, brightness and texture. Data mining is able to intelligently prepare data and allow types of information to be distinguished Security and Privacy As data mining is applied to large semantic documents or XML documents, extraction of information should consider privacy and rights management of shared data. XML mining should have the authorization level to empower security to restrict only to appropriate users to discover classified information. REFERENCES Abiteboul, S., Buneman, P., & Suciu, D. (2000). Data on the Web: From Relations to Semistructured Data and XML: California: Morgan Kaumann. Abiteboul, S., Quass, D., McHugh, J., Widom, J., and Weiner, J. (1997). The Lorel Query Language for Semistructured Data. Journal of Digital Libraries, 1(1), Adelberg, B. (1998). NoDoSE: A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. Paper presented at the Proceedings of the ACM SIGMOD Conference on Management of Data, Seattle, USA. Agrawal, R., & Srikant, R. (1996). Mining Sequential Patterns: Generalizations and Performance Improvements. Paper presented at the the 5th International Conference on Extending Database Technology (EDBT'96), France. Bertino, E., Guerrini, G., & Mesiti, M. (2004). A Matching Algorithm for Measuring the Structural Similarity between an XML Document and a DTD and its applications. Information Systems, 29(1), Boag, S., Chamberlin, D., Fernández, M., Florescu, D., et al.xquery 1.0: An XML Query Language. Retrieved September, 2005, from Braga, D., Campi, A., Ceri, S., Klemettinen, M., et al. (2002). A Tool for Extracting XML Association Rules. Paper presented at the Proceedings of the 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'02), USA. Ceri, S., Comai, S., Damiani, E., Fraternali, P., Paraboschi, S., and Tanca, L. (1999). XML-GL: A Graphical Language for Quering and Restructuring XML Documents. Paper presented at the Proc. 8th International WWW Conference, Toronto, Canada. 18

20 Chi, Y., Nijssen, S., & Muntz, R. (2005). Frequent Subtree Mining - An Overview. Fundamenta Informatiace Special Issue on Graph and Tree Mining, 66(1-2), Connotate Technologies: vtag. (200). 2006, from Dalamagas, T., Cheng, T., Winkel, K., & Sellis, T. K. (2004). Clustering XML documents by Structure. Paper presented at the SETN. Do, H. H., & Rahm, E. (2002). COMA - A System for Flexible Combination of Schema Matching Approaches. Paper presented at the 28th VLDB, Hong Kong, China. Doan, A., Domingos, R., & Halevy, A. Y. (2001). Reconciling schemas of disparate sources: a machinelearning approach. Paper presented at the ACM SIGMOD, Santa Barbara, California, United States. Edmonds, A. (2005). XML Miner & Metarule White Paper. Retrieved January 14, 2005, from Fayyad, U. M., Piatetsky-Shapiro, G., and Smyth, P. (1995). From Data Mining to Knowledge Discovery: An Overview. In U. M. Fayyad, Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R (Ed.), Advances in Knowledge Discovery and Data Mining (pp. 1-34): AAAI Press. Feng, L., Chang, E., & Dillon, T. (2002). A Semantic Network-Based Design Methodology for XML Documents. ACM Transactions of Information Systems (TOIS), 20(4), Fernandez, M., Buneman, P., and Suciu, D. (2000). (2000). UNQL: A Query Language and Algebra for Semistructured Data based on Structural Recursion. VLDB JOURNAL: Very Large Data Bases, 9(1), Flesca, S., Manco, G., Masciari, E., Pontieri, L., et al. (2005). Fast Detection of XML Structural Similarities. IEEE Transaction on Knowledge and Data Engineering, 7(2), Garofalakis, M., Rastogi, R., Seshadri, S., and Shim, K. (1999). Data Mining and the Web: Past, Present and Future. Paper presented at the The second international workshop on web information and data management, Kansas City, USA. Garofalakis, M. N., Gionis, A., Rastogi, R., Seshadri, S., & Shim, K. (2000). XTRACT: A System for Extracting Document Type Descriptors from XML Documents. Paper presented at the Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Texas, USA. Guardalben, G. (2004). Integrating XML and Relational Database Technologies: A Position Paper. Retrieved May 1st, 2005, from er.pdf Han, J., & Kamber, M. (2001). Data Mining: Concepts and Techiques. San Diego, USA: Morgan Kaufmann. Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. Paper presented at the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys (CSUR), 31(3), Jeong, E., & Hsu, C.-N. (2001). Induction of integrated view for XML data with heterogeneous DTDs. Paper presented at the 10th International Conference on Information and Knowledge Management, Atlanta, Georgia, USA. Kay, M. (2000). SAXON DTD Generator - A Tool to Generate XML DTDs, January, 2006, from Kotasek, P., and Zendulka, J. (2000). An XML Framework Proposal for Knowledge Discovery in Database. Paper presented at the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, Workshop Proceedings Knowledge Management: Theory and Applications, Lyon, French. Kotásek, P., & Zendulka, J. (2002). Describing the Data Mining Process with DMSL. Paper presented at the ADBIS Kurgan, L., Swiercz, W., & Cios, K. (2002). Semantic Mapping of XML Tags using Inductive Machine Learning. Paper presented at the International Conference on Machine Learning and Applications 2002 (ICMLA). Kutty, S., Nayak, R., & Li, Y. (2007). PCITMiner- Prefix-based Closed Induced Tree Miner for finding closed induced frequent subtrees. Paper presented at the the Sixth Australasian Data Mining Conference (AusDM 2007), Gold Coast, Australia. Lee, J. W., & Park, S. S. (2004). Finding Maximal Similar Paths Between XML Documents Using Sequential Patterns. Paper presented at the ADVIS, Izmir, Turkey. Lee, L. M., Yang, L. H., Hsu, W., & Yang, X. (2002). XClust: Clustering XML Schemas for Effective Integration. Paper presented at the 11th ACM International Conference on Information and Knowledge Management (CIKM'02), Virginia. Leung, H.-p., Chung, F.-l., & Chan, S. C.-f. (2005). On the use of hierarchical information in sequential miningbased XML document similarity computation. Knowledge and Information Systems, 7(4),

Introduction to XML. Data Integration. Structure in Data Representation. Yanlei Diao UMass Amherst Nov 15, 2007

Introduction to XML. Data Integration. Structure in Data Representation. Yanlei Diao UMass Amherst Nov 15, 2007 Introduction to XML Yanlei Diao UMass Amherst Nov 15, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau. 1 Structure in Data Representation Relational data is highly

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

An XML Based Data Exchange Model for Power System Studies

An XML Based Data Exchange Model for Power System Studies ARI The Bulletin of the Istanbul Technical University VOLUME 54, NUMBER 2 Communicated by Sondan Durukanoğlu Feyiz An XML Based Data Exchange Model for Power System Studies Hasan Dağ Department of Electrical

More information

A Workbench for Prototyping XML Data Exchange (extended abstract)

A Workbench for Prototyping XML Data Exchange (extended abstract) A Workbench for Prototyping XML Data Exchange (extended abstract) Renzo Orsini and Augusto Celentano Università Ca Foscari di Venezia, Dipartimento di Informatica via Torino 155, 30172 Mestre (VE), Italy

More information

Lightweight Data Integration using the WebComposition Data Grid Service

Lightweight Data Integration using the WebComposition Data Grid Service Lightweight Data Integration using the WebComposition Data Grid Service Ralph Sommermeier 1, Andreas Heil 2, Martin Gaedke 1 1 Chemnitz University of Technology, Faculty of Computer Science, Distributed

More information

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)? Database Indexes How costly is this operation (naive solution)? course per weekday hour room TDA356 2 VR Monday 13:15 TDA356 2 VR Thursday 08:00 TDA356 4 HB1 Tuesday 08:00 TDA356 4 HB1 Friday 13:15 TIN090

More information

XML: extensible Markup Language. Anabel Fraga

XML: extensible Markup Language. Anabel Fraga XML: extensible Markup Language Anabel Fraga Table of Contents Historic Introduction XML vs. HTML XML Characteristics HTML Document XML Document XML General Rules Well Formed and Valid Documents Elements

More information

Introduction to XML Applications

Introduction to XML Applications EMC White Paper Introduction to XML Applications Umair Nauman Abstract: This document provides an overview of XML Applications. This is not a comprehensive guide to XML Applications and is intended for

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

XML DATA INTEGRATION SYSTEM

XML DATA INTEGRATION SYSTEM XML DATA INTEGRATION SYSTEM Abdelsalam Almarimi The Higher Institute of Electronics Engineering Baniwalid, Libya Belgasem_2000@Yahoo.com ABSRACT This paper describes a proposal for a system for XML data

More information

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS Abdelsalam Almarimi 1, Jaroslav Pokorny 2 Abstract This paper describes an approach for mediation of heterogeneous XML schemas. Such an approach is proposed

More information

Agents and Web Services

Agents and Web Services Agents and Web Services ------SENG609.22 Tutorial 1 Dong Liu Abstract: The basics of web services are reviewed in this tutorial. Agents are compared to web services in many aspects, and the impacts of

More information

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Modern Databases. Database Systems Lecture 18 Natasha Alechina Modern Databases Database Systems Lecture 18 Natasha Alechina In This Lecture Distributed DBs Web-based DBs Object Oriented DBs Semistructured Data and XML Multimedia DBs For more information Connolly

More information

Data Mining for Web-Enabled Electronic Business Applications. Richi Nayak

Data Mining for Web-Enabled Electronic Business Applications. Richi Nayak Data Mining for Web-Enabled Electronic Business Applications Richi Nayak School of Information Systems Queensland University of Technology Brisbane QLD 4001, Australia r.nayak@qut.edu.au ABSTRACT Web-Enabled

More information

XML Processing and Web Services. Chapter 17

XML Processing and Web Services. Chapter 17 XML Processing and Web Services Chapter 17 Textbook to be published by Pearson Ed 2015 in early Pearson 2014 Fundamentals of http://www.funwebdev.com Web Development Objectives 1 XML Overview 2 XML Processing

More information

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks Ramaswamy Chandramouli National Institute of Standards and Technology Gaithersburg, MD 20899,USA 001-301-975-5013 chandramouli@nist.gov

More information

The basic data mining algorithms introduced may be enhanced in a number of ways.

The basic data mining algorithms introduced may be enhanced in a number of ways. DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,

More information

XML and Data Management

XML and Data Management XML and Data Management XML standards XML DTD, XML Schema DOM, SAX, XPath XSL XQuery,... Databases and Information Systems 1 - WS 2005 / 06 - Prof. Dr. Stefan Böttcher XML / 1 Overview of internet technologies

More information

Semistructured data and XML. Institutt for Informatikk INF3100 09.04.2013 Ahmet Soylu

Semistructured data and XML. Institutt for Informatikk INF3100 09.04.2013 Ahmet Soylu Semistructured data and XML Institutt for Informatikk 1 Unstructured, Structured and Semistructured data Unstructured data e.g., text documents Structured data: data with a rigid and fixed data format

More information

An Approach to Eliminate Semantic Heterogenity Using Ontologies in Enterprise Data Integeration

An Approach to Eliminate Semantic Heterogenity Using Ontologies in Enterprise Data Integeration Proceedings of Student-Faculty Research Day, CSIS, Pace University, May 3 rd, 2013 An Approach to Eliminate Semantic Heterogenity Using Ontologies in Enterprise Data Integeration Srinivasan Shanmugam and

More information

04 XML Schemas. Software Technology 2. MSc in Communication Sciences 2009-10 Program in Technologies for Human Communication Davide Eynard

04 XML Schemas. Software Technology 2. MSc in Communication Sciences 2009-10 Program in Technologies for Human Communication Davide Eynard MSc in Communication Sciences 2009-10 Program in Technologies for Human Communication Davide Eynard Software Technology 2 04 XML Schemas 2 XML: recap and evaluation During last lesson we saw the basics

More information

Translating between XML and Relational Databases using XML Schema and Automed

Translating between XML and Relational Databases using XML Schema and Automed Imperial College of Science, Technology and Medicine (University of London) Department of Computing Translating between XML and Relational Databases using XML Schema and Automed Andrew Charles Smith acs203

More information

Data Integration for XML based on Semantic Knowledge

Data Integration for XML based on Semantic Knowledge Data Integration for XML based on Semantic Knowledge Kamsuriah Ahmad a, Ali Mamat b, Hamidah Ibrahim c and Shahrul Azman Mohd Noah d a,d Fakulti Teknologi dan Sains Maklumat, Universiti Kebangsaan Malaysia,

More information

A Mind Map Based Framework for Automated Software Log File Analysis

A Mind Map Based Framework for Automated Software Log File Analysis 2011 International Conference on Software and Computer Applications IPCSIT vol.9 (2011) (2011) IACSIT Press, Singapore A Mind Map Based Framework for Automated Software Log File Analysis Dileepa Jayathilake

More information

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE) HIDDEN WEB EXTRACTOR DYNAMIC WAY TO UNCOVER THE DEEP WEB DR. ANURADHA YMCA,CSE, YMCA University Faridabad, Haryana 121006,India anuangra@yahoo.com http://www.ymcaust.ac.in BABITA AHUJA MRCE, IT, MDU University

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Big Data Text Mining and Visualization. Anton Heijs

Big Data Text Mining and Visualization. Anton Heijs Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark

More information

DTD Tutorial. About the tutorial. Tutorial

DTD Tutorial. About the tutorial. Tutorial About the tutorial Tutorial Simply Easy Learning 2 About the tutorial DTD Tutorial XML Document Type Declaration commonly known as DTD is a way to describe precisely the XML language. DTDs check the validity

More information

Databases in Organizations

Databases in Organizations The following is an excerpt from a draft chapter of a new enterprise architecture text book that is currently under development entitled Enterprise Architecture: Principles and Practice by Brian Cameron

More information

Data Mining Governance for Service Oriented Architecture

Data Mining Governance for Service Oriented Architecture Data Mining Governance for Service Oriented Architecture Ali Beklen Software Group IBM Turkey Istanbul, TURKEY alibek@tr.ibm.com Turgay Tugay Bilgin Dept. of Computer Engineering Maltepe University Istanbul,

More information

AN ENHANCED DATA MODEL AND QUERY ALGEBRA FOR PARTIALLY STRUCTURED XML DATABASE

AN ENHANCED DATA MODEL AND QUERY ALGEBRA FOR PARTIALLY STRUCTURED XML DATABASE THE UNIVERSITY OF SHEFFIELD DEPARTMENT OF COMPUTER SCIENCE RESEARCH MEMORANDA CS-03-08 MPHIL/PHD UPGRADE REPORT AN ENHANCED DATA MODEL AND QUERY ALGEBRA FOR PARTIALLY STRUCTURED XML DATABASE SUPERVISORS:

More information

An XML Schema Extension for Structural Information in Internet and Web-Based Systems

An XML Schema Extension for Structural Information in Internet and Web-Based Systems An XML Schema Extension for Structural Information in Internet and Web-Based Systems Jacky C.K. Ma and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong

More information

XML WEB TECHNOLOGIES

XML WEB TECHNOLOGIES XML WEB TECHNOLOGIES Chakib Chraibi, Barry University, cchraibi@mail.barry.edu ABSTRACT The Extensible Markup Language (XML) provides a simple, extendable, well-structured, platform independent and easily

More information

Unified XML/relational storage March 2005. The IBM approach to unified XML/relational databases

Unified XML/relational storage March 2005. The IBM approach to unified XML/relational databases March 2005 The IBM approach to unified XML/relational databases Page 2 Contents 2 What is native XML storage? 3 What options are available today? 3 Shred 5 CLOB 5 BLOB (pseudo native) 6 True native 7 The

More information

Integrating XML and Databases

Integrating XML and Databases Databases Integrating XML and Databases Elisa Bertino University of Milano, Italy bertino@dsi.unimi.it Barbara Catania University of Genova, Italy catania@disi.unige.it XML is becoming a standard for data

More information

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms Irina Astrova 1, Bela Stantic 2 1 Tallinn University of Technology, Ehitajate tee 5, 19086 Tallinn,

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

A common interface for multi-rule-engine distributed systems

A common interface for multi-rule-engine distributed systems A common interface for multi-rule-engine distributed systems Pierre de Leusse, Bartosz Kwolek and Krzysztof Zieliński Distributed System Research Group, AGH University of Science and Technology Krakow,

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1.1 Introduction Nowadays, with the rapid development of the Internet, distance education and e- learning programs are becoming more vital in educational world. E-learning alternatives

More information

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS Kalpesh Adhatrao, Aditya Gaykar, Amiraj Dhawan, Rohit Jha and Vipul Honrao ABSTRACT Department of Computer Engineering, Fr.

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery Dimitrios Kourtesis, Iraklis Paraskakis SEERC South East European Research Centre, Greece Research centre of the University

More information

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems Proceedings of the Postgraduate Annual Research Seminar 2005 68 A Model-based Software Architecture for XML and Metadata Integration in Warehouse Systems Abstract Wan Mohd Haffiz Mohd Nasir, Shamsul Sahibuddin

More information

A Strategic Framework for Enterprise Information Integration of ERP and E-Commerce

A Strategic Framework for Enterprise Information Integration of ERP and E-Commerce A Strategic Framework for Enterprise Information Integration of ERP and E-Commerce Zaojie Kong, Dan Wang and Jianjun Zhang School of Management, Hebei University of Technology, Tianjin 300130, P.R.China

More information

Lesson 4 Web Service Interface Definition (Part I)

Lesson 4 Web Service Interface Definition (Part I) Lesson 4 Web Service Interface Definition (Part I) Service Oriented Architectures Module 1 - Basic technologies Unit 3 WSDL Ernesto Damiani Università di Milano Interface Definition Languages (1) IDLs

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

CST6445: Web Services Development with Java and XML Lesson 1 Introduction To Web Services 1995 2008 Skilltop Technology Limited. All rights reserved.

CST6445: Web Services Development with Java and XML Lesson 1 Introduction To Web Services 1995 2008 Skilltop Technology Limited. All rights reserved. CST6445: Web Services Development with Java and XML Lesson 1 Introduction To Web Services 1995 2008 Skilltop Technology Limited. All rights reserved. Opening Night Course Overview Perspective Business

More information

Intelligent Data Analysis: Issues and Challenges

Intelligent Data Analysis: Issues and Challenges Intelligent Data Analysis: Issues and Challenges Richi Nayak School of Information Systems Queensland University of Technology Brisbane, QLD 4001, Australia r.nayak@qut.edu.au ABSTRACT Today with the advances

More information

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on

More information

COURSE RECOMMENDER SYSTEM IN E-LEARNING

COURSE RECOMMENDER SYSTEM IN E-LEARNING International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand

More information

XML- New meta language in e-business

XML- New meta language in e-business 1 XML- New meta language in e-business XML (extensible Markup Language) has established itself as a new meta language in e-business. No matter what, text, pictures, video- or audio files - with the flexibility

More information

EFFECTIVE STORAGE OF XBRL DOCUMENTS

EFFECTIVE STORAGE OF XBRL DOCUMENTS EFFECTIVE STORAGE OF XBRL DOCUMENTS An Oracle & UBmatrix Whitepaper June 2007 Page 1 Introduction Today s business world requires the ability to report, validate, and analyze business information efficiently,

More information

Change Management for XML, in XML

Change Management for XML, in XML This is a draft for a chapter in the 5 th edition of The XML Handbook, due for publication in late 2003. Authors: Martin Bryan, Robin La Fontaine Change Management for XML, in XML The benefits of change

More information

How To Write A Contract Versioning In Wsdl 2.2.2

How To Write A Contract Versioning In Wsdl 2.2.2 023_013613517X_20.qxd 8/26/08 6:21 PM Page 599 Chapter 20 Versioning Fundamentals 20.1 Basic Concepts and Terminology 20.2 Versioning and Compatibility 20.3 Version Identifiers 20.4 Versioning Strategies

More information

Web Database Integration

Web Database Integration Web Database Integration Wei Liu School of Information Renmin University of China Beijing, 100872, China gue2@ruc.edu.cn Xiaofeng Meng School of Information Renmin University of China Beijing, 100872,

More information

2. Distributed Handwriting Recognition. Abstract. 1. Introduction

2. Distributed Handwriting Recognition. Abstract. 1. Introduction XPEN: An XML Based Format for Distributed Online Handwriting Recognition A.P.Lenaghan, R.R.Malyan, School of Computing and Information Systems, Kingston University, UK {a.lenaghan,r.malyan}@kingston.ac.uk

More information

XML Schema Definition Language (XSDL)

XML Schema Definition Language (XSDL) Chapter 4 XML Schema Definition Language (XSDL) Peter Wood (BBK) XML Data Management 80 / 227 XML Schema XML Schema is a W3C Recommendation XML Schema Part 0: Primer XML Schema Part 1: Structures XML Schema

More information

Component visualization methods for large legacy software in C/C++

Component visualization methods for large legacy software in C/C++ Annales Mathematicae et Informaticae 44 (2015) pp. 23 33 http://ami.ektf.hu Component visualization methods for large legacy software in C/C++ Máté Cserép a, Dániel Krupp b a Eötvös Loránd University mcserep@caesar.elte.hu

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

Data Integration Hub for a Hybrid Paper Search

Data Integration Hub for a Hybrid Paper Search Data Integration Hub for a Hybrid Paper Search Jungkee Kim 1,2, Geoffrey Fox 2, and Seong-Joon Yoo 3 1 Department of Computer Science, Florida State University, Tallahassee FL 32306, U.S.A., jungkkim@cs.fsu.edu,

More information

Web Services Technologies

Web Services Technologies Web Services Technologies XML and SOAP WSDL and UDDI Version 16 1 Web Services Technologies WSTech-2 A collection of XML technology standards that work together to provide Web Services capabilities We

More information

estatistik.core: COLLECTING RAW DATA FROM ERP SYSTEMS

estatistik.core: COLLECTING RAW DATA FROM ERP SYSTEMS WP. 2 ENGLISH ONLY UNITED NATIONS STATISTICAL COMMISSION and ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing (Bonn, Germany, 25-27 September

More information

Extensible Markup Language (XML): Essentials for Climatologists

Extensible Markup Language (XML): Essentials for Climatologists Extensible Markup Language (XML): Essentials for Climatologists Alexander V. Besprozvannykh CCl OPAG 1 Implementation/Coordination Team The purpose of this material is to give basic knowledge about XML

More information

Novel Data Extraction Language for Structured Log Analysis

Novel Data Extraction Language for Structured Log Analysis Novel Data Extraction Language for Structured Log Analysis P.W.D.C. Jayathilake 99X Technology, Sri Lanka. ABSTRACT This paper presents the implementation of a new log data extraction language. Theoretical

More information

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS. PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software

More information

Using Object And Object-Oriented Technologies for XML-native Database Systems

Using Object And Object-Oriented Technologies for XML-native Database Systems Using Object And Object-Oriented Technologies for XML-native Database Systems David Toth and Michal Valenta David Toth and Michal Valenta Dept. of Computer Science and Engineering Dept. FEE, of Computer

More information

Business Object Document (BOD) Message Architecture for OAGIS Release 9.+

Business Object Document (BOD) Message Architecture for OAGIS Release 9.+ Business Object Document (BOD) Message Architecture for OAGIS Release 9.+ an OAGi White Paper Document #20110408V1.0 Open standards that open markets TM Open Applications Group, Incorporated OAGi A consortium

More information

Wee Keong Ng. Web Data Management. A Warehouse Approach. With 106 Illustrations. Springer

Wee Keong Ng. Web Data Management. A Warehouse Approach. With 106 Illustrations. Springer Sourav S. Bhowmick Wee Keong Ng Sanjay K. Madria Web Data Management A Warehouse Approach With 106 Illustrations Springer Preface vii 1 Introduction 1 1.1 Motivation 2 1.1.1 Problems with Web Data 2 1.1.2

More information

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Chapter 5 Foundations of Business Intelligence: Databases and Information Management 5.1 Copyright 2011 Pearson Education, Inc. Student Learning Objectives How does a relational database organize data,

More information

Model-Driven Data Warehousing

Model-Driven Data Warehousing Model-Driven Data Warehousing Integrate.2003, Burlingame, CA Wednesday, January 29, 16:30-18:00 John Poole Hyperion Solutions Corporation Why Model-Driven Data Warehousing? Problem statement: Data warehousing

More information

DataDirect XQuery Technical Overview

DataDirect XQuery Technical Overview DataDirect XQuery Technical Overview Table of Contents 1. Feature Overview... 2 2. Relational Database Support... 3 3. Performance and Scalability for Relational Data... 3 4. XML Input and Output... 4

More information

Fast and Easy Delivery of Data Mining Insights to Reporting Systems

Fast and Easy Delivery of Data Mining Insights to Reporting Systems Fast and Easy Delivery of Data Mining Insights to Reporting Systems Ruben Pulido, Christoph Sieb rpulido@de.ibm.com, christoph.sieb@de.ibm.com Abstract: During the last decade data mining and predictive

More information

Xml Mediator and Data Management

Xml Mediator and Data Management Adaptive Data Mediation over XML Data Hui Lin, Tore Risch, Timour Katchaounov Hui.Lin, Tore.Risch, Timour.Katchaounov@dis.uu.se Uppsala Database Laboratory, Uppsala University, Sweden To be published in

More information

Intelligent Agents and XML - A method for accessing webportals in both B2C and B2B E-Commerce

Intelligent Agents and XML - A method for accessing webportals in both B2C and B2B E-Commerce Intelligent Agents and XML - A method for accessing webportals in both B2C and B2B E-Commerce Mühlbacher, Jörg R., Reisinger, Susanne, Sonntag, Michael Institute for Information Processing and Microprocessor

More information

Standardized Multimedia Retrieval in Distributed Heterogenous Database Systems. Dr. Mario Döller

Standardized Multimedia Retrieval in Distributed Heterogenous Database Systems. Dr. Mario Döller Standardized Multimedia Retrieval in Distributed Heterogenous Database Systems Dr. Mario Döller Motivation Current Situation Query Languages MMRS Metadata Annotation Professional Content Provider SQL/MM

More information

ANALYSIS OF WEBSITE USAGE WITH USER DETAILS USING DATA MINING PATTERN RECOGNITION

ANALYSIS OF WEBSITE USAGE WITH USER DETAILS USING DATA MINING PATTERN RECOGNITION ANALYSIS OF WEBSITE USAGE WITH USER DETAILS USING DATA MINING PATTERN RECOGNITION K.Vinodkumar 1, Kathiresan.V 2, Divya.K 3 1 MPhil scholar, RVS College of Arts and Science, Coimbatore, India. 2 HOD, Dr.SNS

More information

XML. CIS-3152, Spring 2013 Peter C. Chapin

XML. CIS-3152, Spring 2013 Peter C. Chapin XML CIS-3152, Spring 2013 Peter C. Chapin Markup Languages Plain text documents with special commands PRO Plays well with version control and other program development tools. Easy to manipulate with scripts

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Organizational Search in Email Systems

Organizational Search in Email Systems Western Kentucky University TopSCHOLAR Masters Theses & Specialist Projects Graduate School 5-1-2012 Organizational Search in Email Systems Sruthi Bhushan Pitla Western Kentucky University, sruthibhushan.pitla698@topper.wku.edu

More information

A COGNITIVE APPROACH IN PATTERN ANALYSIS TOOLS AND TECHNIQUES USING WEB USAGE MINING

A COGNITIVE APPROACH IN PATTERN ANALYSIS TOOLS AND TECHNIQUES USING WEB USAGE MINING A COGNITIVE APPROACH IN PATTERN ANALYSIS TOOLS AND TECHNIQUES USING WEB USAGE MINING M.Gnanavel 1 & Dr.E.R.Naganathan 2 1. Research Scholar, SCSVMV University, Kanchipuram,Tamil Nadu,India. 2. Professor

More information

Database Systems. Lecture 1: Introduction

Database Systems. Lecture 1: Introduction Database Systems Lecture 1: Introduction General Information Professor: Leonid Libkin Contact: libkin@ed.ac.uk Lectures: Tuesday, 11:10am 1 pm, AT LT4 Website: http://homepages.inf.ed.ac.uk/libkin/teach/dbs09/index.html

More information

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web Corey A Harper University of Oregon Libraries Tel: +1 541 346 1854 Fax:+1 541 346 3485 charper@uoregon.edu

More information

A New Marketing Channel Management Strategy Based on Frequent Subtree Mining

A New Marketing Channel Management Strategy Based on Frequent Subtree Mining A New Marketing Channel Management Strategy Based on Frequent Subtree Mining Daoping Wang Peng Gao School of Economics and Management University of Science and Technology Beijing ABSTRACT For most manufacturers,

More information

Semantic Search in Portals using Ontologies

Semantic Search in Portals using Ontologies Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br

More information

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning Proceedings of the 6th WSEAS International Conference on Applications of Electrical Engineering, Istanbul, Turkey, May 27-29, 2007 115 Data Mining for Knowledge Management in Technology Enhanced Learning

More information

Mining a Change-Based Software Repository

Mining a Change-Based Software Repository Mining a Change-Based Software Repository Romain Robbes Faculty of Informatics University of Lugano, Switzerland 1 Introduction The nature of information found in software repositories determines what

More information

Introduction to Service Oriented Architectures (SOA)

Introduction to Service Oriented Architectures (SOA) Introduction to Service Oriented Architectures (SOA) Responsible Institutions: ETHZ (Concept) ETHZ (Overall) ETHZ (Revision) http://www.eu-orchestra.org - Version from: 26.10.2007 1 Content 1. Introduction

More information

A LANGUAGE INDEPENDENT WEB DATA EXTRACTION USING VISION BASED PAGE SEGMENTATION ALGORITHM

A LANGUAGE INDEPENDENT WEB DATA EXTRACTION USING VISION BASED PAGE SEGMENTATION ALGORITHM A LANGUAGE INDEPENDENT WEB DATA EXTRACTION USING VISION BASED PAGE SEGMENTATION ALGORITHM 1 P YesuRaju, 2 P KiranSree 1 PG Student, 2 Professorr, Department of Computer Science, B.V.C.E.College, Odalarevu,

More information

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

More information

Data mining in software engineering

Data mining in software engineering Intelligent Data Analysis 15 (2011) 413 441 413 DOI 10.3233/IDA-2010-0475 IOS Press Data mining in software engineering M. Halkidi a, D. Spinellis b, G. Tsatsaronis c and M. Vazirgiannis c, a Department

More information

A standards-based approach to application integration

A standards-based approach to application integration A standards-based approach to application integration An introduction to IBM s WebSphere ESB product Jim MacNair Senior Consulting IT Specialist Macnair@us.ibm.com Copyright IBM Corporation 2005. All rights

More information

Model-Mapping Approaches for Storing and Querying XML Documents in Relational Database: A Survey

Model-Mapping Approaches for Storing and Querying XML Documents in Relational Database: A Survey Model-Mapping Approaches for Storing and Querying XML Documents in Relational Database: A Survey 1 Amjad Qtaish, 2 Kamsuriah Ahmad 1 School of Computer Science, Faculty of Information Science and Technology,

More information

XIII. Service Oriented Computing. Laurea Triennale in Informatica Corso di Ingegneria del Software I A.A. 2006/2007 Andrea Polini

XIII. Service Oriented Computing. Laurea Triennale in Informatica Corso di Ingegneria del Software I A.A. 2006/2007 Andrea Polini XIII. Service Oriented Computing Laurea Triennale in Informatica Corso di Outline Enterprise Application Integration (EAI) and B2B applications Service Oriented Architecture Web Services WS technologies

More information

Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise

More information

ONTOLOGY-BASED MULTIMEDIA AUTHORING AND INTERFACING TOOLS 3 rd Hellenic Conference on Artificial Intelligence, Samos, Greece, 5-8 May 2004

ONTOLOGY-BASED MULTIMEDIA AUTHORING AND INTERFACING TOOLS 3 rd Hellenic Conference on Artificial Intelligence, Samos, Greece, 5-8 May 2004 ONTOLOGY-BASED MULTIMEDIA AUTHORING AND INTERFACING TOOLS 3 rd Hellenic Conference on Artificial Intelligence, Samos, Greece, 5-8 May 2004 By Aristomenis Macris (e-mail: arism@unipi.gr), University of

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing

More information