Applying machine learning to integrating Big Data Publication Date: Sept. 2014 Product code: IT0014-002934 Tony Baer
Summary Catalyst Traditional data integration approaches may not scale for Big Data. The new norm is that, as data volumes grow, so does the number and diversity of data sources. Tamr is applying innovations with machine learning to automate and scale the process for integrating and reconciling data from multiple sources. It complements, not replaces, tools and practices for data transformation and master data management. Key messages Data integration challenges are magnified by Big Data. Tamr leverages machine learning, an increasingly popular approach for data transformation, and is applying it at the upstream step of reconciling and integrating data from multiple sources. Tamr is carving a foothold in a new portion of an emerging market for Big Data integration, an area where there will be significant opportunity for OEM and "coopetition" strategies. Ovum view Tamr is carving a new foothold into the Big Data end of an established market. While a number of data preparation tools leveraging machine learning are emerging, Tamr is unique in applying similar approaches to the upstream step of consolidating data. It will complement data preparation and master data management tools in deriving the big picture from Big Data. Recommendations for enterprises Why put Tamr on your radar? Data integration, a perennial challenge for data warehousing, is magnified with Big Data. Not only are the data sets larger, but in all likelihood, so are the number and variety of sources. Traditional data integration approaches will not scale because of the sheer number, variety, and increasingly dynamic nature of data sources. With machine learning transforming the data transformation process, Tamr is applying similar approaches to make the inevitable issue of data integration and consolidation doable for Big Data. Highlights Background Data management issues associated with data warehousing are compounded by Big Data. Beyond the issue of scale is the likelihood that more data sets will be involved that, in many cases, will come from external sources where the provenance of data is less known. And, unlike traditional data warehousing, which used relatively static internal data sets, Big Data analytics are likely to consume 2014 Ovum. All rights reserved. Unauthorized reproduction prohibited. Page 2
data sets where the content and structure of data is constantly morphing. The Ovum report, Data Quality and Big Data: From Discovery to Precision, called for new approaches to be applied to data cleansing. Some of our recommendations included: determining whether the goal is getting "the big picture" (which does not require as rigorous a strategy for cleansing data) or "the exact picture" assigning confidence levels regarding data validity because, unlike internal sources, it is virtually impossible to be 100% certain regarding data quality or consistency leveraging new approaches such as crowdsourcing, machine learning, and data science techniques to vouch for data. Tamr applies many of these approaches to a similar task that is conducted upstream of data cleansing: integrating and consolidating data from multiple sources. It characterizes its approach as "data curation at scale." As a technology that employs probabilistic matching, it is best suited for use cases aimed at deriving the big picture. During data ingestion, Tamr extracts whatever metadata exists and makes rough guesses regarding matching columns from multiple data sources, and displays histograms showing the relative levels of certainty on the matches. A workflow manager is available for organizing human expert input to help refine the matching logic; the same process is then repeated with matching individual records. Machine learning, tweaked with human intelligence, steadily improves over time. Current position The company was founded by the same team Andy Palmer and Michael Stonebraker who previously started Vertica. In May 2014, the company released the first version of its product and received $16m in Series-A funding from Google Ventures and New Enterprise Associates. Tamr is part of a wave of data management start-ups filling the vacuum in the Big Data third-party tooling ecosystem. This is an essential development for making Big Data and platforms such as Hadoop accessible to the mainstream enterprise market much as it was for data warehousing and business intelligence nearly 20 years ago. For data integration-related tasks, much of the start-up activity has heavily leveraged machine learning; it is useful, not only because of the scale of data involved, but also for helping overcome uncertainty. Start-ups such as Trifacta and Paxata emerged, applying such techniques to data preparation, an approach subsequently embraced by incumbents Informatica and IBM. Tamr has adopted a similar approach but applied it to a different upstream problem curating data from multiple sources. It has identified use cases with customer data, product parts catalogs, and health claims reconciliation, among others. Tamr's opportunity and challenge is being one of the first to make a stab at the data integration stage of the process. IBM has publicly stated its direction to develop a "Big Match" capability for Big Data that would complement its MDM (master data management) tools, and Ovum expects more players to surface. The most promising initial use case is reconciling identities from internal customer relationship management (CRM) and related systems with social data feeds, as Ovum has found customer-related applications as being one of the most popular among early Big Data adopters. There are further 2014 Ovum. All rights reserved. Unauthorized reproduction prohibited. Page 3
opportunities for Tamr to integrate with data preparation tools applying similar approaches. Ultimately, Ovum believes that data curation and data preparation should be integrated as a single workflow. Data sheet Key facts Table 1: Data sheet: Tamr Product name Tamr Product classification Data integration Version number 1.0 Release date May 2014 Industries covered All Geographies covered North America Relevant company sizes Midsized to large Licensing options Subscription URL www.tamr.com Routes to market Direct Company headquarters Source: Ovum Cambridge, Massachusetts, US Number of employees 25 Appendix On the Radar On the Radar is a series of research notes about vendors bringing innovative ideas, products, or business models to their markets. Although On the Radar vendors may not be ready for prime time, they bear watching for their potential impact on markets and could be suitable for certain enterprise and public sector IT organizations. Further reading Data Quality and Big Data: From Discovery to Precision, IT014-002596 (May 2012) Author Tony Baer, Principal Analyst, Software Information Management Ovum Consulting We hope that this analysis will help you make informed and imaginative business decisions. If you have further requirements, Ovum s consulting team may be able to help you. For more information about Ovum s consulting capabilities, please contact your Ovum representative. 2014 Ovum. All rights reserved. Unauthorized reproduction prohibited. Page 4
Copyright notice and disclaimer The contents of this product are protected by international copyright laws, database rights and other intellectual property rights. The owner of these rights is Informa Telecoms and Media Limited, our affiliates or other third party licensors. All product and company names and logos contained within or appearing on this product are the trademarks, service marks or trading names of their respective owners, including Informa Telecoms and Media Limited. This product may not be copied, reproduced, distributed or transmitted in any form or by any means without the prior permission of Informa Telecoms and Media Limited. Whilst reasonable efforts have been made to ensure that the information and content of this product was correct as at the date of first publication, neither Informa Telecoms and Media Limited nor any person engaged or employed by Informa Telecoms and Media Limited accepts any liability for any errors, omissions or other inaccuracies. Readers should independently verify any facts and figures as no liability can be accepted in this regard readers assume full responsibility and risk accordingly for their use of such information and content. Any views and/or opinions expressed in this product by individual authors or contributors are their personal views and/or opinions and do not necessarily reflect the views and/or opinions of Informa Telecoms and Media Limited. 2014 Ovum. All rights reserved. Unauthorized reproduction prohibited. Page 5
CONTACT US www.ovum.com (212) 652-2647 INTERNATIONAL OFFICES Beijing Dubai Hong Kong Hyderabad Johannesburg London Melbourne New York San Francisco Sao Paulo Tokyo