A Technical Review of TIBCO Patterns Search
2 TABLE OF CONTENTS SUMMARY... 3 ARCHITECTURAL OVERVIEW... 3 HOW DOES TIBCO PATTERNS SEARCH WORK?... 5 ELIMINATE THE NEED FOR RULES... 7 LOADING AND SYNCHRONIZING DATA... 8
3 Summary Since man first tried to decipher poorly formed characters on papyrus, humans have been using their innate ability to decipher errors and inconsistencies within data and recognize the underlying similarities. In more recent history, a number of automated techniques been developed to deal with poor data quality data. Each technique had its problems and limitations compared to the way humans work and most work only on names. They also require a significant amount of computing resources and often make errors. Unfortunately, as the size of enterprise and agency databases continues to grow, organizations have had little choice but to depend on these inadequate approaches, despite their limitations. TIBCO has approached the problem from a different angle: a mathematical pattern recognition that does not need to know the semantic or phonetic representation of data. The TIBCO Patterns algorithms look at the position of the characters and groups of characters (tokens) and their positional relationship to other data. While there have been several similar attempts over the years, other methods have all failed due to the immense computational needs this particular approach normally entails. In other words, they were impractical for today s huge data sets and sub-second performance requirements. The TIBCO patented approach uniquely scales to provide real-time responsiveness on very large databases about many different types of entities, in any language, with no specialized rules. As a result: it is now possible to deploy a powerful matching technology more rapidly one that delivers a significantly higher accuracy than other available methods. Architectural Overview TIBCO Patterns Search is an in-memory database search system that can be attached to virtually any data source, including Oracle, SQL Server, DBII, MYSQL. The search functions are integrated into applications via standard APIs using any common programming language, including Java,.Net, Python, and C/C++. Native integration with TIBCO BusinessWorks, TIBCO BusinessEvents, and TIBCO ActiveMatrix are also provided.
4 The engine is natively supported under Linux, various UNIX platforms, and Windows all on 32- and 64-bit processors. The engine can sustainably provide real-time, highly accurate search capabilities for small, medium, large, and extra large databases. The engine s architecture is such that all requests can be load-balanced across multiple instances and partitioned to handle databases of any size with subsecond latency. Multi-threaded, federated queries are possible enabling you to take advantage of a wide range of server environments, data schema, and business application needs. Not limited by data volume or throughput, any commodity hardware will do. One of our largest implementation has 700 million records processing 25 queries per second around the clock on a relatively modest blade server infrastructure. The highly compact engine is contained within a single executable of about 1MB size, which makes deployment easy on any size platform. Try the Live Demos To see how this works, please visit: www.netrics.com/demo. While viewing, note these have had no prepossessing, cleaning, matching rules, scrubbing, or normalizing whatsoever. They run on one commodity two-socket DELL PowerEdge Server with Intel CPUs running RedHat Linux.
5 How Does TIBCO Patterns Search Work? The search engine uses advanced mathematical modeling and bi-partite graphbased algorithms to calculate similarity scores. The clever (patented) part is how it processes an extremely large number of records in a very short amount of time all on standard hardware. The result: a powerful matching engine that can distinguish between patterns of data that strict SQL-type search and other types of matching solutions cannot perform. The engine is completely agnostic as to the type of data or domain and language. It makes no assumptions about whether your data is name and address, products, medical records, or double-byte characters representing supplier names languages can be intermixed. Its cultural and domain independence allows you to deploy the engine within hours, without prior knowledge of the type, structure, or state of your data. This all means that you don t have to build rules, perform data profiling, or normalize in order to find and capture significantly meaningful information from your data sources. Connecting the engine to all of your existing applications requires as few as 25 to 30 lines of code. Sample JAVA Implementation Code The following is a sample matching request in Java: this defines the connection to the matching engine, matches data (defined by query) within and across field boundaries (defined by field names) and returns a Java object result set for interpretation. import java.io.ioexception; import com.tibco.likeit.tibcoexception; import com.tibco.likeit.tibcoquery; import com.tibco.likeit.tibcosearchcfg; import com.tibco.likeit. TIBCOSearchOpts; import com.tibco.likeit.tibcosearchresponse; import com.tibco.likeit. TIBCOSearchResult; import com.tibco.likeit.tibcoserverinterface; public class test { public static String query(string host, String port, String table, String query) throws IOException, TIBCOException { TIBCOServerInterface si = new TIBCOSer verinterface(localhost, 5051, false, false);
6 //defines the connection to the engine TIBCOSearchResponse resp = null; String []fieldnames = { first, middle, last, street, city, zip, state }; TIBCOSearchOpts opts = new TIBCOSearchOpts(); TIBCOSearchCfg []tblcfgs = new TIBCOSearchCfg[1]; tblcfgs[0] = new TIBCOSearchCfg(people_table); tblcfgs[0].settibcoquery(tibcoquery.simple( jasoz fitgerlad klassen st paul mn 551,fieldNames,null)); //defines a cross-field query of the query sting against all fields in the table resp = si.search(tblcfgs, opts); // perform the query // extract result set as a string String s = ; TIBCOSearchResult []res = resp.getsearchresults(); for(int i = 0 ; i < res.length ; i++) { s += Double.toString(res[i].getMatchScore()); for (int j = 0; j<res[i].getfields().length; j++) { s += Integer.toString(i) + : + res[i].getfields()[j]; } s += \n ; } return s; } }
7 Eliminate the Need for Rules With TIBCO Patterns Search, rules do not matter. If required, the engine does provide for cross-token, cross field matching, back and forth across field boundaries in any way you choose. You also have fine-grained control of which fields are used for which part of the query, including how they are combined and how they relate to each other. You can also define the individual field or token sensitivity, weighting, and many other parameters that control the matching process. For most applications, very little tuning is required to obtain extremely accurate results. For every query, the engine returns a result set. Each are ranked and scored according to their similarity to the search text. The engine delivers not only a perrecord score across all fields, it can also provide individual scoring at the field and character level. A standard feature provides HTML embedded in the result record data that visually highlights which portions of the data records at the field and character level contributed to the match and to what degree. Typical Deployment TIBCO Patterns Search is invoked through an API that internally uses a TCP socket interface enabling horizontal scalability, as well as flexible load-balancing and failover options. Client libraries are provided, which allow the application to access the full functionality of the engine.
8 Loading and Synchronizing Data 1. Initial loading of the data into the engine 2. Synchronizing with updates of data and the engine index in near-real time Loading Data Static or Dynamic Data Source In this case, the data is loaded after a batch update to the data source (for example after the nightly update of product information). This is typically implemented by a cursor that iterates through the source table in the RDBMS and for each record or set of records invokes the TIBCO API to insert the records into the TIBCO Patterns Search table. In some situations, an initial load of the data has to be performed from an RDBMS while it is undergoing live changes. The challenge is to ensure that a constant set of data is loaded that also provides a well-defined entry point (timestamp) for the dynamic ongoing updates. The dynamic update then processes all changes from that timestamp moving forward. Synchronizing with Updates to the Underlying Tables
TIBCO Software Inc. (NASDAQ: TIBX) technology digitized Wall Street in the 80s with its event-driven Information Bus software, which helped make real-time business a strategic differentiator in the 90s. Today, TIBCO s infrastructure software gives customers the ability to constantly innovate by connecting applications and data in a service-oriented architecture, streamlining activities through business process management, and giving people the information and intelligence tools they need to make faster and smarter decisions, what we call The Power of Now. TIBCO serves more than 4,000 customers around the world with offices in more than 20 countries and an ecosystem of over 200 partners. Learn more at www.tibco.com. Global Headquarters 3303 Hillview Avenue Palo Alto, CA 94304 Tel: +1 650-846-1000 +1 800-420-8450 Fax: +1 650-846-1005 www.tibco.com 2010, TIBCO Software Inc. All rights reserved. TIBCO, the TIBCO logo, The Power of Now, and TIBCO Software are trademarks or registered trademarks of TIBCO Software Inc. in the United States and/or other countries. All other product and company names and marks mentioned in this document are the property of their respective owners and are mentioned for identifi cation purposes only.