Data Preparation at Forefront of Effective Analytics

Similar documents
E-Guide THE CHALLENGES BEHIND DATA INTEGRATION IN A BIG DATA WORLD

From Lab to Factory: The Big Data Management Workbook

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Data Discovery, Analytics, and the Enterprise Data Hub

Big Data and the Data Warehouse

Scalable Enterprise Data Integration Your business agility depends on how fast you can access your complex data

How to Run a Successful Big Data POC in 6 Weeks

E-Guide BRINGING BIG DATA INTO A DATA WAREHOUSE ENVIRONMENT

Datalogix. Using IBM Netezza data warehouse appliances to drive online sales with offline data. Overview. IBM Software Information Management

Data Virtualization A Potential Antidote for Big Data Growing Pains

Big Data Integration: A Buyer's Guide

A TECHTARGET WHITE PAPER

Three Open Blueprints For Big Data Success

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Hadoop Data Hubs and BI. Supporting the migration from siloed reporting and BI to centralized services with Hadoop

Twitter Tag: #briefr 8/14/12

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

INTELLIGENT BUSINESS STRATEGIES WHITE PAPER

Understanding the Value of In-Memory in the IT Landscape

The Modern Data Warehouse: Agile, Automated, Adaptive

Cloud Integration and the Big Data Journey - Common Use-Case Patterns

Ganzheitliches Datenmanagement

Innovate and Grow: SAP and Teradata

Putting Data Governance and Stewardship Into Play

SQL TOOLS OFFER USERS EASIER ENTRY INTO HADOOP DATA

Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora

An Artesian Whitepaper

Accelerate BI Initiatives With Self-Service Data Discovery And Integration

On the Radar: Tamr. Applying machine learning to integrating Big Data. Publication Date: Sept Product code: IT

UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX

Getting Started Practical Input For Your Roadmap

Digital Business Platform for SAP

Data Warehouse Automation A Decision Guide

Buying vs. Building Business Analytics. A decision resource for technology and product teams

Real People, Real Insights SAP runs analytics solutions from SAP

Establish and maintain Center of Excellence (CoE) around Data Architecture

Best Practices in Leveraging a Staging Area for SaaS-to-Enterprise Integration

Integrating Big Data into Business Processes and Enterprise Systems

Application Of Business Intelligence In Agriculture 2020 System to Improve Efficiency And Support Decision Making in Investments.

Using In-Memory Data Fabric Architecture from SAP to Create Your Data Advantage

Big Data Comes of Age: Shifting to a Real-time Data Platform

Skills shortage, training present pitfalls for big data analytics

BBBT Podcast Transcript

BIG Data Analytics Move to Competitive Advantage

Cinda Daly. Who is the champion of knowledge sharing in your organization?

Why Big Data Analytics?

Data Doesn t Communicate Itself Using Visualization to Tell Better Stories

Understanding the SAP BI Strategy

Big Data and Its Impact on the Data Warehousing Architecture

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers

TOP 8 TRENDS FOR 2016 BIG DATA

End Small Thinking about Big Data

Perficient Doubles Microsoft Cloud Revenues Yearly Since 2010; Boosts Trusted Advisor Status

Master Plan: Getting Your Money s Worth From MDM

Empower Individuals and Teams with Agile Data Visualizations in the Cloud

Using Master Data in Business Intelligence

Data Warehousing in the Cloud

TECHNOLOGY TRANSFER PRESENTS MIKE FERGUSON NEXT GENERATION DATA MANAGEMENT BUILDING AN ENTERPRISE DATA RESERVOIR AND DATA REFINERY

Best practices for managing the data warehouse to support Big Data

"Bite-sized" Business Intelligence (BI) for Enterprise Risk Management (ERM) Institute of Internal Auditors - Dallas Chapter

VIEWPOINT. High Performance Analytics. Industry Context and Trends

WHITE PAPER SPLUNK SOFTWARE AS A SIEM

Busting 7 Myths about Master Data Management

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

The Future of Business Analytics is Now! 2013 IBM Corporation

INTELLIGENT BUSINESS STRATEGIES WHITE PAPER

The Enterprise Data Hub and The Modern Information Architecture

DATA ENGINEERING FELLOWS PROGRAM

Data Mining in the Swamp

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

Application and Infrastructure Monitoring In A Distributed World Beyond Summary Statistics

Smarter Analytics. Barbara Cain. Driving Value from Big Data

Big Data Analytics Nokia

Windows Server 2003 migration: Your three-phase action plan to reach the finish line

Analance Data Integration Technical Whitepaper

Rocky Mountain Technology Ventures. Exploring the Intricacies and Processes Involving the Extraction, Transformation and Loading of Data

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Cisco Data Preparation

Don t Build the Zoo: Discover Content in its Natural Habitat Written By: Chris McKinzie CEO & Co-Founder

BIG DATA & DATA SCIENCE

DRIVING THE CHANGE ENABLING TECHNOLOGY FOR FINANCE 15 TH FINANCE TECH FORUM SOFIA, BULGARIA APRIL

6 Steps to Faster Data Blending Using Your Data Warehouse

Analance Data Integration Technical Whitepaper

Integrate Big Data into Business Processes and Enterprise Systems. solution white paper

E-Guide HADOOP MYTHS BUSTED

SERVICE-ORIENTED IT ORGANIZATION

Investor Presentation. Second Quarter 2015

A Guide to Preparing Your Data for Tableau

HiTech. White Paper. A Next Generation Search System for Today's Digital Enterprises

Self-Service Business Intelligence: The hunt for real insights in hidden knowledge Whitepaper

Enabling Better Business Intelligence and Information Architecture With SAP PowerDesigner Software

Tips to ensuring the success of big data analytics initiatives

Unifying the Enterprise Data Hub and the Integrated Data Warehouse

BUSINESS INTELLIGENCE. Keywords: business intelligence, architecture, concepts, dashboards, ETL, data mining

The Definitive Guide to Strategic Analytics. White Paper

Luncheon Webinar Series May 13, 2013

Deductive Data Warehouses and Aggregate (Derived) Tables

MDM for the Enterprise: Complementing and extending your Active Data Warehousing strategy. Satish Krishnaswamy VP MDM Solutions - Teradata

Reflections on Agile DW by a Business Analytics Practitioner. Werner Engelen Principal Business Analytics Architect

Transcription:

Data Preparation at Forefront of Effective Analytics Confronted with the onslaught of big data, business intelligence analysts are finding a fast friend in self-service data preparation software. EMPOWERED TO INTEGRATE AND PREPARE DATA DATA PREP COURTS

EDITOR S NOTE A Matter of Preparation You can t analyze data if it s not integrated and formatted properly, and that s assuming you can even find the information you need. Sounds simple enough to rectify, right? Just get the data, and get it in order. Yet those were the two biggest challenges on predictive analytics initiatives cited by respondents to a 2015 Ventana Research survey. Forty percent said preparing data for analysis was the top hurdle in their organizations, while 22% pointed to problems in accessing data. Big data environments have further complicated things, according to Forrester Research analyst Michele Goetz. In a September 2015 blog post, she wrote that data scientists and other analysts typically have had to do a significant amount of manual data wrangling as part of big data analytics applications. That s partly because data integration and management processes have been oriented primarily toward maintaining the system-of-record status of corporate data warehouses for standardized reporting, Goetz noted. But she said new self-service data preparation tools offer a possible way to improve the lot of analysts. This handbook focuses on self-service software, which is designed to help data analysts integrate information from various source systems and prep it for analysis. First, consultant David Loshin shares his thoughts on how data preparation tools can contribute to more flexible analytics efforts. Next, News and Site Editor Jack Vaughan looks at self-service data preparation technologies at two organizations and the software s machine learning attributes. Vaughan closes with a report on the associated concept of data curation and other alternatives to the monolithic schemas of traditional data warehouses. n Craig Stedman Executive Editor, SearchDataManagement 2

ANALYTICS Business Analysts Empowered to Integrate and Prepare Data A growing number of business analysts are sharpening their skills in writing ad hoc queries and analytical algorithms to uncover useful information in corporate data stores and help their organizations become more data-driven in making business decisions. Yet as these workers become more sophisticated in their use of analytics tools, many find that conventional data warehouse architectures impede their ability to analyze data for three core reasons. First, the traditional data warehouse typically is a repository for data sets that have been extracted from internal transaction processing or operational systems for use in reporting on business performance. That limits the scope and types of analyses performed against the data. Second, the extracted data sets are integrated and standardized using a monolithic set of business rules to align with a predefined data model designed for dimensional slicing and dicing. Doing so filters out information that may be relevant to particular analytics applications. Third, the IT group is usually responsible for developing the rules and processes for transforming the data going into a data warehouse. Many business analysts think conventional data warehouse architectures can impede their ability to analyze data. This approach similarly may not meet the information needs of the analysts who are ultimately expected to use that data. Obviously, conventional data warehousing processes can work for companies, but the data landscape is rapidly changing. Instead of pulling data from internal sources only, 3

ANALYTICS organizations increasingly are looking to blend their own transaction data with information coming from a variety of other sources. They include website clickstream and activity logs, sensors on manufacturing equipment and other devices, customer email, social networks and streaming data feeds from customers, data aggregators and third-party information services providers. BEYOND THE DATA WAREHOUSE Exploiting these often external data sources can boost efforts to generate actionable intelligence that, when paired with changes in business processes, makes a company truly data-driven. In many cases, though, the added data is better suited to being processed and stored in a big data platform possibly a Hadoop cluster, NoSQL database or Spark system than in a data warehouse. Or it may be accessible through an external Web portal. In addition, business analysts as well as data scientists and other analytics professionals often want to access different combinations of the available data, sometimes in raw form. For example, the marketing team at a consumer products maker may want to analyze a mix of customer profile records, news feeds and social media data to look for patterns that can help them plan and launch an online marketing campaign. Meanwhile, the customer experience team may want to monitor social media feeds and product reviews from a variety of websites to identify potential product issues so it can take action to placate dissatisfied customers. And so on for other departments. Because each department has different requirements and goals, it s virtually impossible for a homogenized data warehouse to meet an entire company s analytics objectives. Empowering analysts to work with the data that best meets their individual needs can be more fruitful. This approach has implications for various aspects of data integration, including data discovery, ingestion, profiling, validation and quality. But an emerging class of self-service data preparation tools offers a potential helping hand, enabling analysts in an organization s business operations to carry out key pieces of the integration process themselves. 4

ANALYTICS A LOGICAL SEPARATION The new technologies create a sensible segregation of duties among analytics users and IT and data management teams. Business analysts and data scientists can use the data preparation tools to find relevant data in different systems, pull it together, profile and cleanse the data for consistency and define the business rules that govern their use of the information. By using the software, they re able to get more comprehensive and customized views of the data that interests them than they typically get from a traditional data warehouse. Ideally, the analysts also become more accountable for how the data is used. That means they should be tasked with understanding and adhering to high-level governance policies on data consumption and collaborating with others to ensure that the data, and how it s interpreted, remains consistent across the enterprise. Because data sets are being captured and maintained in their original formats, the IT department is freed from having to implement integration and transformation rules dictating what data is available for analysis. Instead, IT s responsibility transitions to managing the overall infrastructure supporting data discovery, integration and analysis and providing control mechanisms to monitor for inconsistencies in data definitions and noncompliance with defined governance directives on using business data. Though relatively new and still maturing, self-service data prep tools can provide business analysts and data scientists greater analytical flexibility. Data warehouses likely aren t going away in most organizations that have deployed them. And self-service data preparation software is a relatively new and still-maturing technology, sold primarily by startup vendors. But the blossoming of the self-service tools points the way to greater analytical flexibility and effectiveness in companies looking to get more out of their data. David Loshin 5

MANAGEMENT Self-Service Data Prep Courts Machine Learning Interest in machine learning technology often revolves around its capacity to automate and improve analytical predictions. But the technology has other uses, too. In one emerging example, machine learning underlays data management advancements in the zone between IT-based data developers and analysts working in business units, powering a new category of self-service data preparation tools. Such tools can search for and access data throughout an organization, combine it with other data sets and perform format conversions as needed before feeding the integrated data into back-end business intelligence systems for analysis. Software vendors assert that the machine learning techniques built into the tools enable them to learn as they go and improve integration performance through continued use. Machine learning itself may not be the first concern for business users trying to exploit their enterprise data for use in analytics applications. Still, they approve of the results produced by the new tools, which make it possible for useful sets of queries and data management task sequences to be saved and reused. I don t see any machine learning happening per se, said Kunal Patel, a data analyst at Inflection.com Inc., a maker of identity management, recruiting and public records search software in Redwood City, Calif. But Patel, whose group is using a data preparation platform from Alation Inc., said the software makes real-time suggestions concerning tactics for mixing data streams. The suggestions can be based on scenarios he has implemented before or on similar jobs that colleagues have run evidence of the machine learning functionality in action. Patel said Alation software users at Inflection can sign in and search for existing queries they can use to do business analytics. Users can also 6

MANAGEMENT make copies of query sets they or others have created, without calling on the IT department for development help. We now have the foundations for nontechnical people to get up to speed, he said. We re already seeing more engagement from, for example, product managers ones who just don t want to spend from four to eight hours a day writing queries. DATA, PREPARE THYSELF Along with Alation, startups like Alteryx, Paxata, Tamr and Trifacta are pursuing self-service data preparation with puzzle pieces or full offerings. More established companies such as IBM, Informatica, Progress Software and Salesforce are entering the fray, too. Alation CEO Satyen Sangani said the company s software is designed to help users increase their data literacy by capturing information about things such as who is using a particular table or query and how often. Sangani doesn t emphasize the machine learning technology that underpins the knowledge capture process. Instead, it s something under the hood, in the way a transmission is under the hood of a car, he said. Self-service data prep is becoming central to big data analytics applications, according to Nenshad Bardoliwalla, co-founder and vice president of products at Paxata. The most difficult part of the analytics process is pulling a lot of data from a lot of different sources, said Bardoliwalla, whose company s data preparation software employs semantic indexing, Spark machine learning and the Hadoop Distributed File System to help users handle large and diverse data workloads. DEL MONTE MIGRATES TO NEW PROCESS Another benefit of self-service data preparation is that IT resources are freed up to focus on other tasks, said Matthew Heinze, who heads BI at Del Monte Foods Co. in Walnut Creek, Calif. He and his team deployed Paxata s platform as part of a companywide migration to a cloud architecture. A limited transaction set involved critical SAP general ledger data and product-shipment information. For reporting purposes, that 7

MANAGEMENT needed to be blended with other types of data. The data preparation tools helped the BI team streamline the process of creating the reports and make it easier for business users to integrate data themselves. Before, Heinze said, people would take SAP data and do offline integration with, say, Nielsen point-of-sale data, using Excel models. But that didn t run very well. Now, the two data sets are sent to a cloud-based Paxata implementation, where the vendor s specialized machine learning algorithms can be applied. Each step in an integration routine is saved in the Paxata environment, which helps you form repeatable integrations that can be consumed by the reporting platform, Heinze said. He added that business users can load data, put integrations together themselves and see the effects of what they ve done immediately, without taking up IT staff time. DATA PREP MEETS CLOUD, DATA LAKE The move toward cloud applications with new data integration requirements and the growing need to navigate Hadoop data lakes filled with a wide variety of data are propelling the interest in self-service data preparation, according to Philip Howard, an analyst at London-based Bloor Research International. With the advent of the data lake, IT is no longer the data gatekeeper, Howard said. You have a data lake, and you want to explore it. But the issue for analysts has remained the same over time that is, how you get access to that information. Meanwhile, he added, people increasingly want to take application data that is in the cloud and bring it together with inhouse data, without a long wait for an IT project to do the integration work. And Howard thinks many business users are ready to take on data preparation chores. Most of the vendors are addressing the needs of people who are reasonably tech-savvy, he explained, although he noted that users who aren t data scientists probably don t care much that the new breed of data prep tools are driven by machine learning technology. If the software running inside has some smarts, if it can give you recommendations based on what you or others have done, that s what is useful, Howard said. Jack Vaughan 8

INTEGRATION Monolithic Data Models Eat Dust Data integration must take a new, more curatorial tack in the age of big data, according to Michael Stonebraker, a database development pioneer, MIT professor and serial entrepreneur whose latest venture is integration software startup Tamr Inc. But other participants at a data management conference held at MIT in July of last year contended that new integration methods are already in play. Stonebraker, speaking at the 2015 MIT Chief Data Officer & Information Quality Symposium, called for the use of emerging data curation tools backed by machine learning algorithms for example, Tamr s technology to help meet the rise in data volume and variety that many businesses are experiencing as they deploy big data systems. Methods relying on global data schemas and traditional extract, transform and load (ETL) tools can t adequately cope with the scale of big data applications or the variety of the data types they involve, the long-time data management industry player asserted. In the past, everyone considered ETL the gold standard. You would extract, transform and load to a common global schema, said Stonebraker, an adjunct professor in MIT s computer science department as well as Tamr s CTO and cofounder, among other industry positions. But now, he added, a global data model is a fantasy for organizations. In his estimation, the prime-quality global data schema was part of data warehousing s early rocky ride. People thought they would build a global enterprise data model and everybody [would] use it, he said. One hundred percent of them failed. Related ETL approaches proved to be labor-intensive, unmanageable and non-scalable, added Stonebraker, who received the Association for Computing Machinery s prestigious A.M. Turing Award for 2014. 9

INTEGRATION As described by Stonebraker and others, data curation focuses on a streamlined process of discovering data sources of interest, cleaning and transforming the data, and semantically integrating it with other data sets before delivering a deduplicated composite result to data consumers within the organization. Tamr, an MIT research spin-off based in Cambridge, Mass., offers what it calls a data unification platform for cataloging and curating large amounts of information. The company is joined in the new data integration scramble by rival vendors such as Paxata, Trifacta Inc. and WorkFusion. DIFFERENT VIEWS FROM THE PODIUM Other presenters and participants at the MIT event took a somewhat different view, saying that what Stonebraker called the bondage of the schema is already well on the wane. John Talburt, chief scientist at Black Oak Analytics Inc. in Little Rock, Ark., agreed that a global data model is unachievable, and that many early data warehouses were failures. But he emphasized that few practitioners now pursuing data models aspire to be so completely encompassing. It s not a magic bullet you don t have to integrate data universally, Talburt said. He added that data management professionals have learned that gradients of data quality can be employed, based on different factors. For example, a company s data of record would be treated differently than social media data used to assess customer sentiment. For Joe Caserta, president at Caserta Concepts LLC, a data warehouse consulting and training company in New York, talk of a central, all-encompassing data model recalls arguments heard in the early days of data warehousing and data marts. Back then, the battle was engaged between a top-down data modeling theory championed by consultant Bill Inmon and others and bottom-up methods espoused by consultant Ralph Kimball and his kindred spirits. To simplify the debate somewhat, the former called for designers to build a monolithic data warehouse that would set the stage for data marts. The latter called for them to build dispersed data marts that set the stage for a central data warehouse. 10

INTEGRATION Caserta co-authored The Data Warehouse ETL Toolkit with Kimball, so his preference for the bottom-up approach isn t surprising. Today, we use model storming to very quickly figure out the business processes, the connections and the dimensions of data, Caserta said. We think of it broadly, with a central hub in mind. But the central data hub appears in iterations, and data modelers and enterprise architects understand it may never be fully achieved, he added. You need to think about the entire enterprise, yes but you need to plan large but build small, Caserta explained. And when you do that, you have to make sure you don t just create data silos, he noted, describing isolated data marts that may be replications of existing systems. There is a way to come up with something of a central approach, but in an iterative way, without waiting for this big, monolithic project to be finished, he said. DATA: THE DAY ONE PROBLEM Technologies have evolved over the years, and ETL is one of them, said Murthy Mathipraka- sam, principal product marketing manager for big data tools at data management vendor Informatica Corp. Reflecting on Stonebraker s remarks about data curation in an interview, he said that processes have become a lot more agile, and there is a lot more collaboration between IT and the business in many user organizations. The world of a heavily regulated monolithic schema no longer exists. MURTHY MATHIPRAKASAM, Informatica It s inevitable that data is going to be all over the place these days, Mathiprakasam said. It s unlikely that you can put everything in neat rows and columns. But you can still achieve centralized understanding of the data. You can have mechanisms to identify what data represents throughout the enterprise. The bottom line, though, is that the world of a heavily regulated monolithic schema no longer exists. That is not the world of 2015, he said. 11

INTEGRATION A top data manager at The Bank of New York Mellon Corp. espoused a similar position at the MIT event. You don t build warehouses that require you to model the world from Day One, said Rajendra Patil, head of data strategy at New York-based BNY Mellon, speaking during a conference session that considered data responsibilities in financial services companies. Patil echoed the critique of the data warehouse and its long-running effects on the IT budgets of companies: They ve spent millions of dollars on data warehouses, and it has only worked to a certain extent. Jack Vaughan 12

ABOUT THE AUTHORS DAVID LOSHIN is the president of Knowledge Integrity Inc., a consulting, training and development company that works with clients on business intelligence, big data, data quality, data governance and master data management initiatives. Loshin writes for many industry publications and several TechTarget websites and is the author of numerous books. Visit his website or email him at loshin@knowledge-integrity.com. JACK VAUGHAN oversees editorial coverage for Search- DataManagement. Previously, he was editor in chief for SearchSOA. Before joining TechTarget in 2004, he was editor at large at Application Development Trends and ADTmag.com. Email him at jvaughan@techtarget.com and follow him on Twitter: @JackVaughanatTT. Data Preparation at Forefront of Effective Analytics is a SearchDataManagement.com e-publication. Bridget Botelho Editorial Director Ron Karjian Managing Editor Moriah Sargent Associate Managing Editor Craig Stedman Executive Editor Linda Koury Director of Online Design Martha Moore Senior Production Editor Doug Olender Publisher dolender@techtarget.com Annie Matthews Director of Sales amatthews@techtarget.com TechTarget 275 Grove Street, Newton, MA 02466 www.techtarget.com 2016 TechTarget Inc. No part of this publication may be transmitted or reproduced in any form or by any means without written permission from the publisher. TechTarget reprints are available through The YGS Group. STAY CONNECTED! Follow @sdatamanagement today. About TechTarget: TechTarget publishes media for information technology professionals. More than 100 focused websites enable quick access to a deep store of news, advice and analysis about the technologies, products and processes crucial to your job. Our live and virtual events give you direct access to independent expert commentary and advice. At IT Knowledge Exchange, our social community, you can get advice and share solutions with peers and experts. COVER ART: FOTOLIA 13