Product Innovation with Big Data A resource for software product organizations and enterprise IT groups Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at pentaho.com.
Introduction Effective product managers are able to focus on the day-to-day fulfillment of user requirements and project deliverables while still keeping their eyes on the horizon, looking toward the market and technology trends that will help them create sustainable competitive advantages. The goal of this brief is to explain why Big Data represents one of those key trends as well as how it can facilitate better outcomes for end users, the business, and other stakeholders. Technology organizations naturally often lead the way in early adoption of disruptive systems, and we have already seen this happen with Big Data. Read on to understand the implications for application problem-solving potential, scalability, and intelligence. Big Data Background Historically, data that was high in volume, diverse in structure, and rapidly changing posed difficult challenges for organizations that were used to working with traditional relational database technology. However, new technical paradigms such as schema-on-read, massively parallel processing, and MapReduce have provided ways to reduce the overhead required to get raw data into a data store and drastically increase the speed and efficiency of processing large amounts of data. They have also made unstructured and semi-structured data much more accessible for businesses. These innovations have begun to unleash actionable analysis on a variety of previously challenging data sources, including web logs, documents and text, social media, mobile devices, and industrial sensors. Even, dark data (data locked in corporate warehouses with little analytic access) has been given new life through these new technologies. As open source Big Data technologies have matured into commercially supported products, we have seen several Big Data platform categories start to gain rapid adoption: Hadoop distributions: Frameworks for large scale data storage and high performance processing across a distributed file system with the MapReduce paradigm, as well as the more recently released MapReduce 2 also known as the YARN data operating system for cluster resource management; ideal for high volume unstructured data. Hadoop vendors include Cloudera, MapR, and Hortonworks, among others. NoSQL stores: Non-traditional databases with a flexible structure; often ideal for extremely rapid data ingestion and large numbers of reads based on key values. Rather than storing data in a relational or tabular structure, these stores may leverage structures such as documents, graphs, key-value pairs, or columns. Sample NoSQL store providers include MongoDB and DataStax (Cassandra database). Analytic databases: Databases designed for high performance analytics, leveraging techniques like compression, column-based storage, and high-speed bulk inserts of structured data; ideal for complex queries and OLAP analysis. Examples include HP Vertica and Amazon Redshift. Taken together, these systems have enabled organizations to start harnessing data that is massive, fast moving, and diverse in structure with powerful implications for both application and analytic capabilities. PENTAHO 2
Changing the Game for Software Applications and Products While the technology landscape is still evolving, teams in the software, web, and hardware areas have actually led the way in delivering real value from Hadoop, NoSQL, Analytic Databases, and other emerging technologies. They have illustrated that integrating these big data systems into existing primarily relational architectures can create highly competitive product capabilities that deliver big benefits to software end users. A good way to understand this is to contrast innovative big data applications with more traditional applications supported largely by relational databases. Aaron Kimball, a committer on the Apache Hadoop project since 2007, indicates that the rigid structural requirements for storage and retrieval of data on relational platforms can limit traditional applications to solving narrowly defined problems. He suggests that Big Data applications can complement these traditional architectures, introducing a wider array of problem-solving possibilities thanks to flexible accommodation of different data structures and latency requirements. 1 An example of this is idea is illustrated by Paytronix, a customer loyalty technology provider to restaurant chains. Initially, Paytronix customers had access to basic survey data on demographic characteristics of their patrons such as age and whether or not they had children. However, this data was self-reported and not always accurate. Leveraging a Big Data architecture including Hadoop and Pentaho for multi-format data processing, Paytronix was able to correct the missing and inaccurate demographic information by modeling and blending customer social media profile data and on-site entree ordering trends. Innovation with this data has allowed Paytronix to go beyond established loyalty program services to provide intelligent marketing and segmentation recommendations to customers that can boost the value generated by those programs. From a strategic point of view, the ability to support novel sources of rich information in near real-time starts to open up possibilities for new products targeted at new use cases and, potentially, new markets. At Pentaho, we have seen that ingesting and blending a wider variety of data sources into Big Data systems can provide a more complete picture of customers across different industries, which ultimately can lead to better business decisions. These analytics can also be automated behind the scenes with data pre-processing and predictive algorithms in order to deliver improved application experiences and operationalize insights as part of product workflow. In other words, Big Data architectures can fuel comprehensive data-driven applications, and not just analytics on large amounts of data. According to recent research, a new design approach is leading to apps that leverage big data predictive analytics to anticipate and provide the right functionality and content on the right device at the right time for the right person by continuously learning about them. 2 Machine learning and predictive analytics, when applied to multi-source blended data at scale, allow applications to become more intelligent and responsive to end user needs in a timely fashion. The same type of Big Data architectures can also bring to life smarter devices and equipment in B2B sectors, like heavy industry and networking. The end result is automated and intelligent products driven by prescriptive analytics. Big Data can also help technology teams deliver a greater degree of scalability, in terms of data volumes processed, user loads, and responsiveness of applications to realtime or near real-time requests. Hadoop, for instance, can accelerate data processing and reduce storage costs by an order of magnitude relative to traditional approaches. Meanwhile, NoSQL frameworks are often able to fuel faster, more efficient application performance on hot or more urgently required data sets that are closer to the presentation layer for end users. In general, many of these technologies have evolved from projects that originally started inside the walls of some of the largest and most successful consumer technology firms, which needed to support user bases that were growing into the hundreds of millions and beyond. This type of scalability is just now becoming accessible to technology firms of all sizes and sub-sectors. 1 Aaron Kimball, The secrets of designing and building big data apps, venturebeat.com, 12/24/2013. PENTAHO 3
Blueprints for Next-Generation Applications While there is not one right way to leverage big data to create a new end user application or enhance an existing one, Pentaho has observed a few patterns based on customer and market experience. This diagram is not meant to illustrate a complete solution architecture, but rather a blueprint for different data components seen in emerging applications. Big Data Application Patterns Weblog & social media data Hadoop Cluster NoSQL Store Fast processing on many data formats Flexible and fast read/write access Machine, sensor, & device data Affordable historical storage Training machine learning algorithms Near-line speed layer for performance Operational store Client-side User Interface Structured & Relational Relational Database Relational Database Web-based experience including embedded visual analytics Customer profile data Existing application data Existing application database May integrate with other enterprise systems & customer profile data Facilitates fast analytic queries for end users Often used in data refinery pattern with Hadoop for Big Data analytics Often we see a two-tiered architecture, where Hadoop serves as a massively scalable back-end archive and training ground for machine learning and predictive analytics algorithms. It also ingests the previously challenging semi-structured and unstructured data that were not a fit for traditional relational database technology. Closer to the user, a NoSQL database often serves as an operational store holding less data than Hadoop but designed to facilitate accelerated application performance and address near real-time data needs. These components together support core application functionality, while an analytic database meets needs for high performance, low-latency ad hoc analysis, visualization, and reporting by end users. The visual analytics are often embedded in the user interface as a seamless part of the end user experience with that application. Overall, the different data stores and frameworks are linked via a data integration and orchestration layer, which may include Pentaho. This both streamlines the delivery of data in the application architecture and facilitates the use of predictive algorithms in an automated process. 2 Mike Gaultieri, Forrester Research, Predictive Apps Are the Next Big Thing in Customer Engagement, 6/25/2013. PENTAHO 4
Real Life Examples As indicated above, the discussed design patterns are based on real-life examples from Pentaho s customer base. Interestingly, many of these examples fall into one of two categories 1) Intelligent CRM, marketing, and e-commerce products, and 2) Internet of Things (IoT) products that leverage sensor, equipment, or device data. We ll discuss an example of each below. RICH RELEVANCE Next Generation Data Platform for Retail Personalization RichRelevance provides a platform that delivers personalization services for Fortune 500 retailers, allowing them to deliver the most relevant content to their customers across online and in-store engagement channels. The platform delivers over 50 million personalized shopping sessions a day with sub-second response times. This performance is only possible thanks to the company s early investment in an intelligent Big Data application architecture. At its core, the RichRelevance platform leverages Hadoop, Hbase (a NoSQL database), and Hive (a relational layer on top of Hadoop), as well as Pentaho though they are always incorporating new frameworks. These systems enable the rapid ingestion and processing of massive amounts of web session information, like pageviews and purchases, as well as rapidly changing product catalog information. On this architecture, RichRelevance runs a variety of regularly updated predictive algorithms based on web visitor behavior, product information, and merchandising objectives in order to determine the bvest content to serve. These recommendations can be optimized to maximize margin and revenue against such constraints as inventory stocks and legal restrictions. RichRelevance has not only streamlined 1.6 Petabytes of diverse data, but they have also embedded Pentaho analytics into their customer facing application to provide insight into the performance of these omni-channel personalization services. Overall, Big Data has enabled RichRelevance to create a unique offering to retailers that serves highly personalized content to each individual shopper in order to boost conversion at scale. RichRelevance Big Data Architecture Server Log Data Data Scientist Data Mining and Machine Learning refinement Customer Demographics Data Marts Website Tracking PDI Business User (CFO) leverage real-time reporting Customer ERP and Supply Data PDI Business Analytics Server End Users Agile BI capabilities and self-service Online Transactions PENTAHO 5
RUCKUS WIRELESS Delivering Differentiated Networking Products with Big Data Ruckus Wireless is a high performance wireless infrastructure provider, catering both to telecommunications carriers and enterprises. Recently, they sought to launch a flexible analytics product to provide their clients with detailed visibility into network traffic, capacity, and performance. In order to provide the best possible product, they adopted a big data architecture that could make the solution scalable to decade-long analysis on millions of user sessions and hundreds of thousands of wireless access points per carrier. In order to meet these needs, Ruckus leverages Pentaho to ingest complex JSON and XML files from the Wi-Fi equipment into a Hadoop cluster, later pulling data into HP Vertica (an analytic database) for high performance WiFi network analytics. Further, they chose to partner with Pentaho in order to OEM an analytics solution for drag-and-drop reporting, ad hoc analysis, and visualization. The new Ruckus analytics offering enables customers to quickly uncover trends in the health and performance of their networks, at a scale of data only possible with a Big Data back-end. Importantly, they ve been able to launch the application as a new revenue-generating product, which complements their hardware-focused core business. Ruckus Wireless Big Data Architecture Data Scientist Data Mining and Machine Learning refinement Unstructured Wi-Fi Data Account and ERP Data Business User (CFO) leverage real-time reporting PDI PDI Business Analytics Server End Users Agile BI capabilities and self-service Machine and Network Data PENTAHO 6
Conclusion The Big Data market is still in its early innings, but we are already seeing pioneering tech teams and product companies leverage Hadoop, NoSQL, and other emerging systems to deliver intelligent, data-driven applications that delight users in novel and valuable ways. Recent changes in the technology landscape have made it possible to build capabilities into applications that were once only dreamed of think intelligent recommendations to millions of users on-demand, and automated granular analytics on sensors across networking equipment, jet engines, and maritime vessels. These use cases are not restricted to firms like Facebook, Netflix, or General Electric they are now much more broadly accessible. PENTAHO 7
Learn more about Pentaho Business Analytics pentaho.com/contact +1 (866) 660-7555. Global Headquarters Citadel International - Suite 340 5950 Hazeltine National Drive Orlando, FL 32822, USA tel +1 407 812 6736 fax +1 407 517 4575 US & Worldwide Sales Office 353 Sacramento Street, Suite 1500 San Francisco, CA 94111, USA tel +1 415 525 5540 toll free +1 866 660 7555 United Kingdom, Rest of Europe, Middle East, Africa London, United Kingdom tel +44 (0) 20 3574 4790 toll free (UK) 0 800 680 0693 FRANCE Offices - Paris, France tel +33 97 51 82 296 toll free (France) 0800 915343 GERMANY, AUSTRIA, SWITZERLAND Offices - Munich, Germany tel +49 (0) 322 2109 4279 toll free (Germany) 0800 186 0332 BELGIUM, NETHERLANDS, LUXEMBOURG Offices - Antwerp, Belgium tel (Netherlands) +31 8 58 880 585 toll free (Belgium) 0800 773 83 ITALY, SPAIN, PORTUGAL Offices - Valencia, Spain toll free (Italy) 800 798 217 toll free (Portugal) 800 180 060 Be social with Pentaho: Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at pentaho.com.