PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services
|
|
|
- Tabitha Moore
- 10 years ago
- Views:
Transcription
1 PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services David Ferrucci 1, Robert L. Grossman 2 and Anthony Levas 1 1. Introduction - The Challenges of Deploying Analytic Applications and Services It is convenient to divide data into structured data, semi-structured data and unstructured data. By structured data, we mean data that is organized into fields or attributes. Examples include database records. Semi-structured data has attributes but does not have the regularity of structured data. Data defined by HTML or XML tags are examples of semi-structured data. Unstructured data lacks attributes or fields and includes text data, signals, images, video, audio or similar data. Of course, data may be a combination of one or more of these types. For example, the content of a message can be unstructured text and the metadata semi-structured XML tags. By an analytic application, we mean an application that analyzes data. Examples include statistical or data mining models, extracting features from images or signals, parsing text, and analyzing text using natural language processing or language models. Today, the process of developing analytic applications presents several challenges. 1. Deployment. Analytic models for structured data are commonly built using a statistical or data mining application, but deployed in operational systems where the models must be often be recoded using C++, Java or other language. This slows down the deployment of new analytic applications and the updating of current ones. 2. No standard architecture. There is no generally accepted standard architecture for analytic applications, which increases the cost and time required to develop them. In particular, a standard architecture for developing and deploying analytic applications would facilitate code reuse. 1 IBM T.J. Watson Research Center 2 Open Data Group and University of Illinois at Chicago Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DM-SSP'06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM X...$
2 3. The difficulty of cleaning, preparing, integrating and transforming data. The cost of developing analytic applications is usually dominated by the cost of collecting, cleaning, transforming and preparing the data prior to its modeling or scoring. If, as is often the case, the data is from two or more separate sources or systems, then the data must also be integrated, complicating the process even further. In this white paper, we describe an approach to developing and deploying analytic applications that addresses each of these challenges. During the past several years, the Data Mining Group s Predictive Model Markup Language or PMML has emerged as an interchange format for analytic models for structured data, while IBM s Unstructured Information Management Architecture (UIMA) has emerged a component based architecture for developing analytic applications for unstructured data. In this paper, we give quick introductions to PMML and UIMA, describe some of the common structure of PMML and UIMA, and describe two different ways that applications combining both PMML and UIMA can be built. To illustrate the core ideas, we describe a case study that shows how an analytic application can be built that automatically transforms messages at one security level into messages at a lower security level. 2. Example Scenarios In this section, we describe two simple scenarios using PMML based and UIMA based models. Categorization. Given a text message, there is a standard trick to create a state vector from it. One simply counts the number of times each word in a dictionary occurs, forms a state vector of the counts, and normalizes the state vector appropriately. This is often called a bag of words model. Given the state vector, standard statistical models for clustering or probabilistic clustering can then be used to categorize the message. There are PMML based models for bag of word models and several different clustering models that can be used in this way to encapsulate the data required for categorizing messages. Processing XML tagged data to lower classification level. Consider an XML tagged message that is classified at a classification level C1 and contains specific information about a specific individual in a specific location. By coarsening the location data and replacing specific information about individuals with more generic information, such as report of hostile activity within the last three hours in named area of Interest A, the message, in many cases, has a lower level of classification, say C2, and can be more broadly distributed. UIMA Annotators can be used to do the various translations required in this example. The examples in this section are described in more detail in Section 8. 15
3 3. PMML-Based Analytic Services Key Concepts PMML is declarative and data-centric in the sense that a markup language is used to describe analytic models in a platform and application independent fashion. In contrast, UIMA is process-centric. In PMML, you create models; in UIMA, you create engines. PMML s approach to developing and deploying analytical applications is based upon a few key concepts: View analytic models as first class objects. We think of analytic models as being described using the Predictive Model Markup Language or PMML. Applications or services can be thought of as producing PMML or consuming PMML. A PMML XML file contains enough information so that an application can process and score a data stream with a statistical or data mining model using only the information in the PMML file. Support data preparation. As mentioned above, data preparation is often the most time consuming part of the data mining process. PMML provides explicit support for many common data transformations and aggregations used when preparing data. Once encapsulated in this way, data preparation can more easily be re-used and leveraged by different components and applications. Facilitate using different architectures for learning models and scoring models. Broadly speaking most analytic applications consist of a learning phase that creates a (PMML) model and a scoring phase that employs the (PMML) model to score a data stream or batch of records. The learning phase usually consists of the following sub-stages: exploratory data analysis, data preparation, event shaping, data modeling, & model validation. The scoring phase is typically simpler and either a stream or batch of data is scored using a model. PMML is designed so that different systems and applications can be used for producing models (PMML Producers) and for consuming models (PMML Consumers). View data as event based. Many analytic applications can be naturally thought of as event based. Event based data presents itself as a stream of events that must are transformed, integrated, or aggregated to produce the state vectors that are inputs to statistical or data mining models. The current version of PMML provides implicit support for event based processing of data; future versions are expected to provide explicit support. 16
4 PMML UIMA Learning In learning, a PMML producer processes a data set to produce a PMML model. A Collection Processing Engine (CPE) can be used in learning or training to analyze a collection or group of documents to produce an Analysis Engine (AE). CPEs can also be used much more generally for a variety of other tasks. Scoring A PMML consumer i) inputs a PMML model and then ii) using the model can process one or more data records to produce scores. An Analysis Engine (AE) analyzes documents based on the model produces by the CPE in the training phase. This is also called an Applier. 4. Overview of PMML In this section, we give a brief introduction to PMML [Grossman:2002]. PMML is an application and system independent interchange format for statistical and data mining models. More precisely, the goal of PMML is to encapsulate in an application and system independent fashion a model so that two different applications (the PMML Producer and Consumer) can use it. PMML is developed by a vendor led working group, which is part of the Data Mining Group [DMG]. PMML consists of the following components: 1. Data Dictionary. The data dictionary defines the fields that are the inputs to models and specifies the type and value range for each field. 2. Mining Schema. Each model contains one mining schema that lists the fields used in the model. These fields are a subset of the fields in the Data Dictionary. The mining schema contains information that is specific to a certain model, while the data dictionary contains data definitions that do not vary with the model. For example, the Mining Schema specifies the usage type of an attribute, which may be active (an input of the model), predicted (an output of the model), or supplementary (holding descriptive information and ignored by the model). 3. Transformation Dictionary. The Transformation Dictionary defines derived fields. Derived fields may be defined by normalization, which maps continuous or discrete values to numbers; by discretization, which maps continuous values to discrete values; by value mapping, which maps discrete values to discrete values; or by aggregation, which summarizes or collects groups of values, for example by computing averages. 4. Model Statistics. The Model Statistics component contains basic univariate statistics about the model, such as the minimum, maximum, mean, standard deviation, median, etc. of numerical attributes. 17
5 5. Model Parameters. PMML also specifies the actual parameters defining the statistical and data mining models per se. Models in PMML Version 3.0 include regression models, clusters models, trees, neural networks, Bayesian models, association rules, sequence models, support vector machines, rule sets, and text models. Figure 1 illustrates the relationship of the Data Dictionary, Mining Schema and Transformation Dictionary. Note that inputs to models can be defined directly from the Mining Schema or indirectly as derived attributes using the Transformation Dictionary. PMML also provides some additional infrastructure that is helpful when for modeling, including: A variety of different data types including, arrays of values, sparse arrays of values, and matrices. Support for missing values, and several different ways of working with them. Data Dictionary Mining Schema Transformation Dictionary Model field 1 field 2 field 3 field 4 field 5 field 1 field 3 field 4 derived field 3 derived field 4 attributes derived attributes model parameters model statistics model extensions Figure 1: This figure shows some of the basic components in a PMML model. Note that inputs to a model are of two types: basic attributes which are directly specified by the Mining Schema and derived attributes which are defined in terms of transformations from the Transformation Dictionary applied to attributes in the mining schema. 18
6 5. Overview of UIMA Key Concepts In this section, we describe some of the key concepts associated with the Unstructured Information Management Architecture (UIMA) and framework. 1 For simplicity and concreteness, we describe these concepts in the context of documents, although UIMA is broadly applicable to other types of semi-structured or unstructured data. This section is adapted in part from [Ferrucci:2004]. Analysis Engines. The fundamental processing component for document analysis in UIMA is the Analysis Engine (AE). AEs have a standardized interface and may be declaratively composed to build aggregate analysis capabilities. AEs built by composition have a recursive structure - primitive AEs may be core analysis components implemented in C++ or Java, whereas aggregate AEs are composed of such primitive AEs or other aggregate AEs. Because primitive and aggregate AEs have exactly the same interfaces, it is possible to recursively assemble advanced analysis components from more basic elements while the implementation details are transparent to the composition task. AE s analyze artifacts (documents, images, conversations, videos etc.) represented in a common data structure called the Common Analysis Structure or CAS. The logical interface to an Analysis Engine is essential CAS in/cas out. AE s behavior is assumed to depend only on its input CAS. AEs are implemented to perform a wide variety of artifact analysis tasks. These include, for example, document tokenization, parsing, name-entity detection, relationship detection, speaker identification, translation etc. Generally, AEs produce metadata describing the artifact in the form of annotations. Annotations are used to label regions of the original artifact. Different types of Analysis Engines include for example, Annotators and CAS Consumers. Annotators typically produce new annotations, adding metadata to a CAS. CAS Consumers extract metadata from a CAS and populate structured resources. These are described in more detail below. Common Analysis Structure (CAS). Analysis Engines operate on and share data through a standardized representation called the CAS. The CAS represents the artifact being analyzed (document, image, conversation, video etc) and the set of assertions produced by some sequence of Analysis Engines. The set of assertions in the CAS is intended to describe the artifact or features there-of and may be collectively referred to as artifact metadata or just metadata. The 1 IBM freely provides a framework implementation supporting the creation and composition and deployment of UIMA-compliant analytic applications. This implementation is included in the UIMA software development kit (SDK) and is available on the IBM alphaworks site. The UIMA Java framework is also available as open-source 19
7 metadata may take the form of annotations. UIMA annotations are similar to the notion of annotations in other systems like Catalyst [Mardis] and TIPSTER [Grishman]. These are implemented as separate objects that point into the original artifact rather than as in-line markup. This allows multiple sets of interpretations or labelings of an artifact to exist simultaneously and does not corrupt the original representation. Analysis Engines may use different interfaces to read and update the CAS. The UIMA framework may also contain different internal implementations of the CAS. The UIMA framework provides different interfaces to the CAS including the JCAS which maintains fast indices to CAS contents and projects a native java object programming model to the data elements in the CAS. Type Systems. A CAS conforms to a user-defined Type System. This is essentially a schema or data model that defines the types of objects and properties that may be asserted into a CAS by a component analytic. Analysis Engine Flows. Aggregate AE s specify a workflow of internal component engines. UIMA includes a control component interface called the Flow Controller. The UIMA framework provides an implementation of this interface that allows the specification of simple static linear flows among component engines. The developer, however, is free to create complex and dynamically-determined flows by providing a different implementation for the Flow Contoller interface. During the execution of an Aggregate Engine the Flow Controller considers incoming CASs (and possible other information about the aggregates component engines or other external information) and determines the next engine in the flow that should get access to each CAS. The framework then performs the necessary routing. Each AE in the flow considers the CAS content and updates the metadata with new assertions. The updated CAS is the results of the analysis. Collection Readers. Another UIMA component interface is the Collection Reader. Collection Reader implementations connect to information sources (databases, file systems, audio or video streams etc.) and segment these sources into artifacts. Each artifact is used to initialize a CAS for down-stream analysis by workflows of Analysis Engines. The logical interface for a Collection Reader is get next CAS. Collection Readers may be implemented to support a wide variety or source formats. They are required to report when the end of collection is detected. CAS Consumers. After an artifact has been analyzed by some flow of Analysis Engines the analysis results are contained in a CAS. The CAS is typically used to populate some structured resources that provide the application targeted and efficient access to the results of interest. CAS Consumers ingest CASs and interface with structured resources including for example, file systems, search engine indexers, databases or knowledgebase. They populate these resources with application-relevant metadata extracted from the CAS. The simplest case may be a CAS Consumer that converts the CAS content into XML format and writes it to a directory on the file system. A more complex example may be a CAS Consumer that extracts a subset of metadata, organizes it as a collection of facts over entities and populates a formal knowledge base. In any case, the structured resources populated by CAS Consumers are in turn used to support highlevel decision support applications. CAS Consumers are so named because they are not intended to update input CASs. 20
8 Collection Processing Engines. In UIMA, a special type of Aggregate Engine called a Collection Processing Engine may be assembled to perform collection-level analysis, where entire collections of documents are analyzed as a whole to produce collection-level results. Collection Processing Engines are made up of a Collection Reader, a set of Analysis Engines and a set of CAS Consumers. Examples of collection-level analysis performed by Collection Processing Engines include the creation of glossaries, dictionaries, databases, search indexes, and ontologies from the results of analyzing entire collections of artifacts. UIMA does not restrict the content, form, or technical realization of collection-level results; they are treated generally as structured information resources. Component Descriptors. All UIMA component implementations are associated with XML documents called Component Descriptors. These contain meta-data describing the component, including its identification, version, constituents (if an aggregate), configuration parameters and capabilities. This information is used to aid in tooling, component management, discovery, packaging and installation. A UIMA Framework A UIMA framework [Ferrucci:2004] supports the component or application developer in creating, composing and deploying UIMA-Compliant component and/or services interfaces. The UIMA SDK is a UIMA framework implemented in Java. In addition to Java-based analysis components, it supports the integration of C++ analytics and allows for the interoperation of component analytics written in other languages viz. services oriented interfaces. UIMA framework functions are designed to simplify the development of analytic applications. They provide a variety of facilitates to the component and application developer. The following is partial list of UIMA framework facilities: Type System Validation. UIMA defines an XML format for specifying type systems. The framework includes utilities for assisting in the definition and validation of type system definitions. Component Descriptor Validation. The UIMA framework includes methods for validating component descriptor XML files. Engine Creation. Analysis Engines are the pluggable components managed, composed and deployed by the framework. Developers, however, implement a simple interface called the Annotator interface. This interface is logically CAS In/CAS Out. Annotator implementations encapsulate the analysis algorithm code. Annotator implementations joined with their component descriptor are passed to factory methods provided by the framework. The framework wraps the annotator with infrastructure code and returns a pluggable analysis engine component that projects the standard analysis engine interface. Engine Composition. Analysis Engines may be composed to produce aggregate analysis capabilities. The UIMA framework supports declarative specifications for describing aggregate engines in terms of component engines and a workflow specification. This declarative specification, called an aggregate descriptor, is passed to a framework factory method that encapsulates the aggregates internal 21
9 components and returns a pluggable Analysis Engine that projects the standard analysis engine interface. The UIMA framework will automatically create a type-system for an aggregate based on merging the type-systems of its named components. Engine Deployment. Analysis Engines may be deployed as Integrated or Remote Engines. Integrated Engines are co-located with the framework run-time and other cooperating components making up an aggregate. CASs are shared as part of the same memory space with other integrated components. Remote Engines are deployed as network services. To communicate with Remote Engines the framework handles CAS serialization and transport using a variety of pluggable transport protocols, including SOAP. The engine developer need not write any special code to deploy an engine as either integrated or remote but rather indicate the desired option in the component s descriptor. CAS Management and Transport. The UIMA framework manages CAS pools and handles CAS routing and transport among components cooperating as part of an Aggregate Analysis Engine or Collection Processing Engine. The framework is responsible for ensuring that, independent of engine deployment options, the CAS is delivered to component engines in the order determined by the appropriate Analysis Sequencer. Collection Processing Management. The UIMA framework supports deployment options for Collection Processing Engines to exploit multi-threading and multi-processing environments to improve robustness and throughput for large collection processing application environments. 7. Building Applications Using Both PMML and UIMA We can now summarize the discussion thus far: By models, we mean statistical or data mining models. Models are defined by PMML files. As an example, an Analysis Engine for a UIMA-based analytic application can employ a PMML model to specify how the Analysis Engine processes documents or other data. As a second example, a Collection Processing Engine for a UIMA-based analytic application can produce a PMML model that is the result of analyzing a collection of documents We assume that components for analytic processing depend only upon inputs and are described by standardized XML interfaces. Both PMML and UIMA support this view of components. A framework provides an implementation of one or more analytic components in a particular a language; say C++, Java or Python. The Data Mining Group is a consortium of vendors who develop and market one or more PMML components. The UIMA Java framework of the UIMA SDK, is open source. By an analytics platform we mean a standards-based collection of interoperable components for the analytic processing of data. Both PMML and UIMA support the key requirements of an analytics platform by employing models and components. 22
10 An analytics platform requires a standardized container for data and metadata. This, for example, is provided in UIMA by the CAS. PMML does not currently have an analogous concept, except as implicitly defined by data and mining schemas. Finally, an analytics platform requires a workflow language so that different components can be combined. UIMA explicitly supports the integration of different components using workflow concepts; currently PMML does not but future versions of PMML are expected to do so. Integration Approaches The simplest way to develop UIMA/PMML based applications is simply to develop an application using UIMA and PMML based services. Another approach to integrating PMML is to wrap PMML components as UIMA Analysis Engines and have these PMML components access and use PMML models as UIMA Shared Resources. This approach provides a relatively easy path to leveraging UIMA s facilities, including those for component aggregation, deployment, reuse, data sharing and workflow management, while enabling PMML capabilities to plug-in by changing how they access and use PMML models. A more sophisticated approach aimed at achieving a tighter integration between UIMA and PMML may represent PMML models directly in the UIMA CAS and enhance a CAS interface to provide specialized functions for manipulating PMML models. This approach would have to continue to support direct XML-based access to PMML models to maintain backward compatibility with PMML components. In this paper we start by describing a case study that takes the straight forward approach outlined above. 8. Case Study This section is adapted from the technical report [OSC:2005]. In this section, we illustrate a Message Transformation Application using a UIMA-based and PMMLbased architecture. In this application, incoming messages at a security level denoted as SLA, stream through a UIMA Collection Processing Engine (CPE) that transforms the text to produce outgoing messages at a lower security level, denoted as SLB. The following high-level sequence of transformation operations is applied to each message in series: 1. Metadata within the message are converted to annotations. 2. Certain words are removed due to their sensitivity. 3. A PMML model is used to categorize the message. 4. On the basis of this categorization, mappings are applied in turn to determine the who, what, where and when of the new message. 5. Additional filtering is done if required on the words and terms in the transformed message. 23
11 Messages SLA Regular Exp. Constrained Vocabulary Vocabulary Subset PDA SLB Collection Reader Annotator Censor PMML Classifier Composer CAS Consumer Aggregate Analysis Engine DB Collection Processing Engine The behaviors and capabilities of the UIMA components comprising the CPE are as follows: Collection Reader. The Collection Reader connects to the message source and iterates through the messages initializing a UIMA CAS for each. The CAS stores the message text to be processed as well as message-level metadata. Annotator. The first UIMA Analysis Engine is an Annotator that parses the message, converting its structure and metadata to annotation objects that are indexed and stored in the CAS for processing by down-stream AEs. In particular, paragraphs with explicit security designations are annotated as either blocked or pass-through depending on a security-level threshold parameter. Censor. The role of the Censor AE is to eliminate specific terms from the message text. These terms are specified by regular expressions listed in an external resource file. The cleaned text is stored in a bag-of-words object stored in the CAS. PMML Classifier. The Classifier AE scores the terms in the constrained vocabulary according to their relevance to the bag-of-words associated with the message. The resulting scores are indexed in the CAS. Composer. The Composer AE produces a transformed message by composing a string from the highest scored terms in the constrained vocabulary. In addition, paragraphs annotated as pass-through are appended to the string without modification. This string is stored in the CAS as a composed-message object. 24
12 CAS Consumer(s). The role of the CAS Consumer is to output the transformed message. In addition, an outgoing message filter may be applied to output only those transformed messages that include a chosen subset of the constrained vocabulary terms. A variety of CAS Consumers can be used simultaneously to affect a range of message storage or transmission options. Regular expressions, representing censor words, terms constituting the constrained vocabulary and scoring engine model parameters, are stored in external resource files. In addition, all other UIMA component parameters are specified in associated Resource Descriptor Files. Thus, all the terms, patterns, and parameters may be modified or extended without requiring re-compilation of the code. 9. Summary and Conclusions In this paper, we have introduced the key concepts and ideas behind the Predictive Model Markup Language (PMML) and the Unstructured Information Management Architecture (UIMA) and have shown how they may be combined to create an open, standards based platform for the analytic processing of structured, semi-structured, and unstructured data. 25
13 References [DMG] The PMML Working Group is part of the Data Mining Group. See [Ferrucci] D. Ferrucci and A. Lally, Building an example application with the Unstructured Information Management Architecture, IBM Systems Journal, Volume 43, 2004, pages [Grishman] R. Grishman, Tipster Architecture Design Document Version 2.2, Technical Report, Defense Advanced Research Projects Agency (DARPA), U.S. Department of Defense (1996). [Grossman:2002] Robert Grossman, Mark Hornick, and Gregor Meyer, Data Mining Standards Initiatives, Communications of the ACM, Volume 45-8, 2002, pages [OSC:2005] Object Sciences Technical Report, A UIMA Example for the High to Low Processing of Messages, [PMML] PMML documentation can be found on the web site.: sourceforge.net/projects/pmml/ [Mardis] S. Mardis and J. Burger, Qanda and the Catalyst Architecture, AAAI Spring Symposium on Mining Answers from Text and Knowledge Bases, American Association for Artificial Intelligence (2002). 26
UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis
UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis Jan Hajič, jr. Charles University in Prague Faculty of Mathematics
Model Deployment. Dr. Saed Sayad. University of Toronto 2010 [email protected]. http://chem-eng.utoronto.ca/~datamining/
Model Deployment Dr. Saed Sayad University of Toronto 2010 [email protected] http://chem-eng.utoronto.ca/~datamining/ 1 Model Deployment Creation of the model is generally not the end of the project.
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment
To appear in a special issue of the Journal of Natural Language Engineering 2004 1 UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment David Ferrucci
Ensembles and PMML in KNIME
Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany [email protected]
The Prolog Interface to the Unstructured Information Management Architecture
The Prolog Interface to the Unstructured Information Management Architecture Paul Fodor 1, Adam Lally 2, David Ferrucci 2 1 Stony Brook University, Stony Brook, NY 11794, USA, [email protected] 2 IBM
Lightweight Data Integration using the WebComposition Data Grid Service
Lightweight Data Integration using the WebComposition Data Grid Service Ralph Sommermeier 1, Andreas Heil 2, Martin Gaedke 1 1 Chemnitz University of Technology, Faculty of Computer Science, Distributed
Fast and Easy Delivery of Data Mining Insights to Reporting Systems
Fast and Easy Delivery of Data Mining Insights to Reporting Systems Ruben Pulido, Christoph Sieb [email protected], [email protected] Abstract: During the last decade data mining and predictive
1 What Are Web Services?
Oracle Fusion Middleware Introducing Web Services 11g Release 1 (11.1.1) E14294-04 January 2011 This document provides an overview of Web services in Oracle Fusion Middleware 11g. Sections include: What
Unstructured Information Management Architecture (UIMA) Version 1.0
Unstructured Information Management Architecture (UIMA) Version 1.0 Working Draft 05 29 May 2008 Specification URIs: This Version: http://docs.oasis-open.org/[tc-short-name] / [additional path/filename].html
Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis
Webinar will begin shortly Hadoop s Advantages for Machine Learning and Predictive Analytics Presented by Hortonworks & Zementis September 10, 2014 Copyright 2014 Zementis, Inc. All rights reserved. 2
Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising
Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising Open Data Partners and AdReady April 2012 1 Executive Summary AdReady is working to develop and deploy sophisticated
Data Integration Checklist
The need for data integration tools exists in every company, small to large. Whether it is extracting data that exists in spreadsheets, packaged applications, databases, sensor networks or social media
Service Virtualization: Managing Change in a Service-Oriented Architecture
Service Virtualization: Managing Change in a Service-Oriented Architecture Abstract Load balancers, name servers (for example, Domain Name System [DNS]), and stock brokerage services are examples of virtual
Shallow Parsing with Apache UIMA
Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland [email protected] Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic
CUSTOMER Presentation of SAP Predictive Analytics
SAP Predictive Analytics 2.0 2015-02-09 CUSTOMER Presentation of SAP Predictive Analytics Content 1 SAP Predictive Analytics Overview....3 2 Deployment Configurations....4 3 SAP Predictive Analytics Desktop
XBRL Processor Interstage XWand and Its Application Programs
XBRL Processor Interstage XWand and Its Application Programs V Toshimitsu Suzuki (Manuscript received December 1, 2003) Interstage XWand is a middleware for Extensible Business Reporting Language (XBRL)
LINKING DOCUMENTS IN REPOSITORIES TO STRUCTURED DATA IN DATABASE
LINKING DOCUMENTS IN REPOSITORIES TO STRUCTURED DATA IN DATABASE A thesis submitted to the University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences
Service-Oriented Architectures
Architectures Computing & 2009-11-06 Architectures Computing & SERVICE-ORIENTED COMPUTING (SOC) A new computing paradigm revolving around the concept of software as a service Assumes that entire systems
1 What Are Web Services?
Oracle Fusion Middleware Introducing Web Services 11g Release 1 (11.1.1.6) E14294-06 November 2011 This document provides an overview of Web services in Oracle Fusion Middleware 11g. Sections include:
An Oracle White Paper October 2013. Oracle Data Integrator 12c New Features Overview
An Oracle White Paper October 2013 Oracle Data Integrator 12c Disclaimer This document is for informational purposes. It is not a commitment to deliver any material, code, or functionality, and should
UIMA Overview & SDK Setup
UIMA Overview & SDK Setup Written and maintained by the Apache UIMA Development Community Version 2.4.0 Copyright 2006, 2011 The Apache Software Foundation Copyright 2004, 2006 International Business Machines
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
www.progress.com DEPLOYMENT ARCHITECTURE FOR JAVA ENVIRONMENTS
DEPLOYMENT ARCHITECTURE FOR JAVA ENVIRONMENTS TABLE OF CONTENTS Introduction 1 Progress Corticon Product Architecture 1 Deployment Options 2 Invoking Corticon Decision Services 4 Corticon Rule Engine 5
A standards-based approach to application integration
A standards-based approach to application integration An introduction to IBM s WebSphere ESB product Jim MacNair Senior Consulting IT Specialist [email protected] Copyright IBM Corporation 2005. All rights
Service Oriented Architecture
Service Oriented Architecture Charlie Abela Department of Artificial Intelligence [email protected] Last Lecture Web Ontology Language Problems? CSA 3210 Service Oriented Architecture 2 Lecture Outline
Characteristics of Java (Optional) Y. Daniel Liang Supplement for Introduction to Java Programming
Characteristics of Java (Optional) Y. Daniel Liang Supplement for Introduction to Java Programming Java has become enormously popular. Java s rapid rise and wide acceptance can be traced to its design
Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016
Network Machine Learning Research Group S. Jiang Internet-Draft Huawei Technologies Co., Ltd Intended status: Informational October 19, 2015 Expires: April 21, 2016 Abstract Network Machine Learning draft-jiang-nmlrg-network-machine-learning-00
Base One's Rich Client Architecture
Base One's Rich Client Architecture Base One provides a unique approach for developing Internet-enabled applications, combining both efficiency and ease of programming through its "Rich Client" architecture.
How To Build A Financial Messaging And Enterprise Service Bus (Esb)
Simplifying SWIFT Connectivity Introduction to Financial Messaging Services Bus A White Paper by Microsoft and SAGA Version 1.0 August 2009 Applies to: Financial Services Architecture BizTalk Server BizTalk
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Enhancing A Software Testing Tool to Validate the Web Services
Enhancing A Software Testing Tool to Validate the Web Services Tanuj Wala 1, Aman Kumar Sharma 2 1 Research Scholar, Department of Computer Science, Himachal Pradesh University Shimla, India 2 Associate
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
In-Database Analytics
Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing
Content Management Using Rational Unified Process Part 1: Content Management Defined
Content Management Using Rational Unified Process Part 1: Content Management Defined Introduction This paper presents an overview of content management, particularly as it relates to delivering content
Make Better Decisions Through Predictive Intelligence
IBM SPSS Modeler Professional Make Better Decisions Through Predictive Intelligence Highlights Easily access, prepare and model structured data with this intuitive, visual data mining workbench Rapidly
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Functional Requirements for Digital Asset Management Project version 3.0 11/30/2006
/30/2006 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 20 2 22 23 24 25 26 27 28 29 30 3 32 33 34 35 36 37 38 39 = required; 2 = optional; 3 = not required functional requirements Discovery tools available to end-users:
JOURNAL OF OBJECT TECHNOLOGY
JOURNAL OF OBJECT TECHNOLOGY Online at www.jot.fm. Published by ETH Zurich, Chair of Software Engineering JOT, 2008 Vol. 7, No. 8, November-December 2008 What s Your Information Agenda? Mahesh H. Dodani,
Tomáš Müller IT Architekt 21/04/2010 ČVUT FEL: SOA & Enterprise Service Bus. 2010 IBM Corporation
Tomáš Müller IT Architekt 21/04/2010 ČVUT FEL: SOA & Enterprise Service Bus Agenda BPM Follow-up SOA and ESB Introduction Key SOA Terms SOA Traps ESB Core functions Products and Standards Mediation Modules
Universal PMML Plug-in for EMC Greenplum Database
Universal PMML Plug-in for EMC Greenplum Database Delivering Massively Parallel Predictions Zementis, Inc. [email protected] USA: 6125 Cornerstone Court East, Suite #250, San Diego, CA 92121 T +1(619)
In: Proceedings of RECPAD 2002-12th Portuguese Conference on Pattern Recognition June 27th- 28th, 2002 Aveiro, Portugal
Paper Title: Generic Framework for Video Analysis Authors: Luís Filipe Tavares INESC Porto [email protected] Luís Teixeira INESC Porto, Universidade Católica Portuguesa [email protected] Luís Corte-Real
Business Rule Standards -- Interoperability and Portability
Rule Standards -- Interoperability and Portability April 2005 Mark H. Linehan Senior Technical Staff Member IBM Software Group Emerging Technology [email protected] Donald F. Ferguson IBM Fellow Software
SIPAC. Signals and Data Identification, Processing, Analysis, and Classification
SIPAC Signals and Data Identification, Processing, Analysis, and Classification Framework for Mass Data Processing with Modules for Data Storage, Production and Configuration SIPAC key features SIPAC is
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful
SIMPLE MACHINE HEURISTIC INTELLIGENT AGENT FRAMEWORK
SIMPLE MACHINE HEURISTIC INTELLIGENT AGENT FRAMEWORK Simple Machine Heuristic (SMH) Intelligent Agent (IA) Framework Tuesday, November 20, 2011 Randall Mora, David Harris, Wyn Hack Avum, Inc. Outline Solution
Extension of a SCA Editor and Deployment-Strategies for Software as a Service Applications
Institut fur Architektur von Anwendungssystemen Universität Stuttgart Universitätsstraße 38 70569 Stuttgart Diplomarbeit Nr. 2810 Extension of a SCA Editor and Deployment-Strategies for Software as a Service
MicroStrategy Course Catalog
MicroStrategy Course Catalog 1 microstrategy.com/education 3 MicroStrategy course matrix 4 MicroStrategy 9 8 MicroStrategy 10 table of contents MicroStrategy course matrix MICROSTRATEGY 9 MICROSTRATEGY
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects
MD Link Integration. 2013 2015 MDI Solutions Limited
MD Link Integration 2013 2015 MDI Solutions Limited Table of Contents THE MD LINK INTEGRATION STRATEGY...3 JAVA TECHNOLOGY FOR PORTABILITY, COMPATIBILITY AND SECURITY...3 LEVERAGE XML TECHNOLOGY FOR INDUSTRY
UIMA and WebContent: Complementary Frameworks for Building Semantic Web Applications
UIMA and WebContent: Complementary Frameworks for Building Semantic Web Applications Gaël de Chalendar CEA LIST F-92265 Fontenay aux Roses [email protected] 1 Introduction The main data sources
IBM WebSphere ILOG Rules for.net
Automate business decisions and accelerate time-to-market IBM WebSphere ILOG Rules for.net Business rule management for Microsoft.NET and SOA environments Highlights Complete BRMS for.net Integration with
Lesson 4 Web Service Interface Definition (Part I)
Lesson 4 Web Service Interface Definition (Part I) Service Oriented Architectures Module 1 - Basic technologies Unit 3 WSDL Ernesto Damiani Università di Milano Interface Definition Languages (1) IDLs
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
Accessing Data with ADOBE FLEX 4.6
Accessing Data with ADOBE FLEX 4.6 Legal notices Legal notices For legal notices, see http://help.adobe.com/en_us/legalnotices/index.html. iii Contents Chapter 1: Accessing data services overview Data
Operationalise Predictive Analytics
Operationalise Predictive Analytics Publish SPSS, Excel and R reports online Predict online using SPSS and R models Access models and reports via Android app Organise people and content into projects Monitor
Data Mining Standards
Data Mining Standards Arati Kadav Jaya Kawale Pabitra Mitra [email protected] [email protected] [email protected] Abstract In this survey paper we have consolidated all the current data mining
How To Build A Connector On A Website (For A Nonprogrammer)
Index Data's MasterKey Connect Product Description MasterKey Connect is an innovative technology that makes it easy to automate access to services on the web. It allows nonprogrammers to create 'connectors'
MODEL DRIVEN DEVELOPMENT OF BUSINESS PROCESS MONITORING AND CONTROL SYSTEMS
MODEL DRIVEN DEVELOPMENT OF BUSINESS PROCESS MONITORING AND CONTROL SYSTEMS Tao Yu Department of Computer Science, University of California at Irvine, USA Email: [email protected] Jun-Jang Jeng IBM T.J. Watson
Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices
overview Pipeline Pilot Enterprise Server Pipeline Pilot Enterprise Server (PPES) is a powerful client-server platform that streamlines the integration and analysis of the vast quantities of data flooding
Introduction to WebSphere Process Server and WebSphere Enterprise Service Bus
Introduction to WebSphere Process Server and WebSphere Enterprise Service Bus Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3 Unit objectives
Auto-Classification for Document Archiving and Records Declaration
Auto-Classification for Document Archiving and Records Declaration Josemina Magdalen, Architect, IBM November 15, 2013 Agenda IBM / ECM/ Content Classification for Document Archiving and Records Management
Content Management Using the Rational Unified Process By: Michael McIntosh
Content Management Using the Rational Unified Process By: Michael McIntosh Rational Software White Paper TP164 Table of Contents Introduction... 1 Content Management Overview... 1 The Challenge of Unstructured
Data Mining + Business Intelligence. Integration, Design and Implementation
Data Mining + Business Intelligence Integration, Design and Implementation ABOUT ME Vijay Kotu Data, Business, Technology, Statistics BUSINESS INTELLIGENCE - Result Making data accessible Wider distribution
Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari [email protected]
Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari [email protected] Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content
Database Marketing, Business Intelligence and Knowledge Discovery
Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski
Technical. Overview. ~ a ~ irods version 4.x
Technical Overview ~ a ~ irods version 4.x The integrated Ru e-oriented DATA System irods is open-source, data management software that lets users: access, manage, and share data across any type or number
Chapter 1: Introduction
Chapter 1: Introduction Database System Concepts, 5th Ed. See www.db book.com for conditions on re use Chapter 1: Introduction Purpose of Database Systems View of Data Database Languages Relational Databases
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Software Development Kit
Open EMS Suite by Nokia Software Development Kit Functional Overview Version 1.3 Nokia Siemens Networks 1 (21) Software Development Kit The information in this document is subject to change without notice
Software Architecture Document
Software Architecture Document Natural Language Processing Cell Version 1.0 Natural Language Processing Cell Software Architecture Document Version 1.0 1 1. Table of Contents 1. Table of Contents... 2
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
DIABLO VALLEY COLLEGE CATALOG 2014-2015
COMPUTER SCIENCE COMSC The computer science department offers courses in three general areas, each targeted to serve students with specific needs: 1. General education students seeking a computer literacy
SAS In-Database Processing
Technical Paper SAS In-Database Processing A Roadmap for Deeper Technical Integration with Database Management Systems Table of Contents Abstract... 1 Introduction... 1 Business Process Opportunities...
Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall.
Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall. 5401 Butler Street, Suite 200 Pittsburgh, PA 15201 +1 (412) 408 3167 www.metronomelabs.com
RS MDM. Integration Guide. Riversand
RS MDM 2009 Integration Guide This document provides the details about RS MDMCenter integration module and provides details about the overall architecture and principles of integration with the system.
01219211 Software Development Training Camp 1 (0-3) Prerequisite : 01204214 Program development skill enhancement camp, at least 48 person-hours.
(International Program) 01219141 Object-Oriented Modeling and Programming 3 (3-0) Object concepts, object-oriented design and analysis, object-oriented analysis relating to developing conceptual models
Dynamic Decision-Making Web Services Using SAS Stored Processes and SAS Business Rules Manager
Paper SAS1787-2015 Dynamic Decision-Making Web Services Using SAS Stored Processes and SAS Business Rules Manager Chris Upton and Lori Small, SAS Institute Inc. ABSTRACT With the latest release of SAS
Bringing Business Objects into ETL Technology
Bringing Business Objects into ETL Technology Jing Shan Ryan Wisnesky Phay Lau Eugene Kawamoto Huong Morris Sriram Srinivasn Hui Liao 1. Northeastern University, [email protected] 2. Stanford University,
IBM WebSphere Application Server Family
IBM IBM Family Providing the right application foundation to meet your business needs Highlights Build a strong foundation and reduce costs with the right application server for your business needs Increase
A Near Real-Time Personalization for ecommerce Platform Amit Rustagi [email protected]
A Near Real-Time Personalization for ecommerce Platform Amit Rustagi [email protected] Abstract. In today's competitive environment, you only have a few seconds to help site visitors understand that you
Knowledgent White Paper Series. Developing an MDM Strategy WHITE PAPER. Key Components for Success
Developing an MDM Strategy Key Components for Success WHITE PAPER Table of Contents Introduction... 2 Process Considerations... 3 Architecture Considerations... 5 Conclusion... 9 About Knowledgent... 10
irods and Metadata survey Version 0.1 Date March Abhijeet Kodgire [email protected] 25th
irods and Metadata survey Version 0.1 Date 25th March Purpose Survey of Status Complete Author Abhijeet Kodgire [email protected] Table of Contents 1 Abstract... 3 2 Categories and Subject Descriptors...
Extracting and Preparing Metadata to Make Video Files Searchable
Extracting and Preparing Metadata to Make Video Files Searchable Meeting the Unique File Format and Delivery Requirements of Content Aggregators and Distributors Table of Contents Executive Overview...
2010-2011 Assessment for Master s Degree Program Fall 2010 - Spring 2011 Computer Science Dept. Texas A&M University - Commerce
2010-2011 Assessment for Master s Degree Program Fall 2010 - Spring 2011 Computer Science Dept. Texas A&M University - Commerce Program Objective #1 (PO1):Students will be able to demonstrate a broad knowledge
Nine Common Types of Data Mining Techniques Used in Predictive Analytics
1 Nine Common Types of Data Mining Techniques Used in Predictive Analytics By Laura Patterson, President, VisionEdge Marketing Predictive analytics enable you to develop mathematical models to help better
2. Distributed Handwriting Recognition. Abstract. 1. Introduction
XPEN: An XML Based Format for Distributed Online Handwriting Recognition A.P.Lenaghan, R.R.Malyan, School of Computing and Information Systems, Kingston University, UK {a.lenaghan,r.malyan}@kingston.ac.uk
Outlines. Business Intelligence. What Is Business Intelligence? Data mining life cycle
Outlines Business Intelligence Lecture 15 Why integrate BI into your smart client application? Integrating Mining into your application Integrating into your application What Is Business Intelligence?
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Rules and Business Rules
OCEB White Paper on Business Rules, Decisions, and PRR Version 1.1, December 2008 Paul Vincent, co-chair OMG PRR FTF TIBCO Software Abstract The Object Management Group s work on standards for business
Kofax Transformation Modules Generic Versus Specific Online Learning
Kofax Transformation Modules Generic Versus Specific Online Learning Date June 27, 2011 Applies To Kofax Transformation Modules 3.5, 4.0, 4.5, 5.0 Summary This application note provides information about
A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel
A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated
Integrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
ORACLE DATABASE 10G ENTERPRISE EDITION
ORACLE DATABASE 10G ENTERPRISE EDITION OVERVIEW Oracle Database 10g Enterprise Edition is ideal for enterprises that ENTERPRISE EDITION For enterprises of any size For databases up to 8 Exabytes in size.
Computer Network. Interconnected collection of autonomous computers that are able to exchange information
Introduction Computer Network. Interconnected collection of autonomous computers that are able to exchange information No master/slave relationship between the computers in the network Data Communications.
