Profit from Big Data flow. Delivering Big Data Success With the Signal Hub Platform

Transcription

1 Profit from Big Data flow Delivering Big Data Success With the Signal Hub Platform

2 2 The Big Data Challenge The business opportunities resulting from Big Data represent a disruptive force that, if properly harnessed, will change how companies operate and compete. While legacy data mining and reporting tools will continue to play a vital role in large enterprises, they were not architected nor intended to capture the value of Big Data. A new type of technology is now needed technology that s as fresh and innovative as the use cases it was designed to solve use cases that didn t exist even five years ago. The organizations grasping these opportunities are drawing new insights from underutilized data assets and then making this breadth of data accessible to broader audiences. And it s Opera Solutions Signal Hub that s behind the scenes of these success stories. It s helped global companies tackle their most complex challenges with methods that are better, faster, and cheaper than applying legacy technology to present-day opportunities. Legacy Solutions Are Insufficient Big Data is an evolution of enterprise analytics and data management processes. It s a huge leap forward but still part of a natural evolution. To understand the role of new Big Data technologies, it s important to understand the ongoing role of legacy tools such as business intelligence (BI) tools and associated data warehouses. BI tools are primarily used for two purposes. First, they help trained data analysts perform ad hoc analysis of historical data. Second, they are used to deliver predefined, standardized business reports and dashboards to executive- and management-level users. These dashboards, or reporting applications, include key performance indicators and charts focused on corporate operations, sales performance, supply chain, and others while allowing end users to drill-down into greater data granularity. For example, a sales VP may want to review national sales metrics and then tease the data apart by region, district, city, and individual store reports. While the market for these BI technologies is well structured and stable, the marketplace for Big Data solutions is relatively undefined and immature. Most products touting Big Data capabilities, while valuable, are merely components of the overarching architecture needed to capture data from a growing number of sources and deliver new types of insights. Working with these new technologies is highly specialized technical work, but just incorporating these new skill sets into a Big Data infrastructure does not guarantee Big Data business results. Often these technology solutions merely shift the burden of delivering business value from IT and software engineers to data scientists. Data scientists represent a necessary yet rare commodity that is a prerequisite for Big Data success.

3 3 Signals Big Noise to Small Data Just because there s suddenly a glut of data doesn t make that data particularly interesting or valuable, at least in its raw form. Like raw crude oil full of impurities, sludge, and gunk, putting it into the machines of commerce unrefined will just gum up the works. And yet, hidden in this unrefined, untamed flow of the world s data is more predictive information than has ever been available before. This is what we call Signals the valuable patterns, connections, and correlations that, if properly extracted, allow us to predict behavior and outcomes far more accurately than we could in the past. Signals are the data elements, patterns, and calculations that have, through scientific experimentation, been proven valuable in predicting a particular outcome. And it s these Signals not the Big Data where they re hidden that hold the real value. Purchase Patterns Payment Patterns Shut off Purchase Payment Behavior Over Time Credit line Increase Request Alert Signals are increasingly important in a Big Data world, where data is a fast-flowing, evergrowing, heterogeneous, and exceedingly noisy input. Big Data s sheer size as well as other statistical properties makes it difficult, if not impossible, to use as is. Transforming data into Signals is absolutely critical; we can rarely use untransformed, raw data successfully. High-quality Signals are necessary to distill the relationships among all of the entities surrounding a problem and across all of the attributes (including their time dimension) associated with these entities. In effect, Signals capture underlying drivers and patterns to create useful, accurate inputs that are capable of being processed by a machine into algorithms. Indeed, for most problems, high-quality Signals are certainly as important in generating an accurate prediction as the underlying machine-learning algorithm that acts upon these Signals in creating the prescriptive action. Signals are key ingredients to solving an array of problems, including classification, regression, clustering (segmentation), forecasting, collaborative filtering, and optimization. Signals are hierarchical. That is, within the Signal Hub, the Signal array might include simple Signals that can be used not only by themselves to predict behavior (e.g., customer behavior powering a recommendation) but can also be used as inputs into more sophisticated predictive models. These models, in turn, generate second-order, highly refined Signals. These Signals typically serve as inputs to business-process decision points. Signals can be both descriptive and predictive and can provide a multi-dimensional view of specific data types.

4 4 Signals can be categorized into classes. Here are a few examples: 1. Sentiment: Captures collective prevailing attitude about an entity, given a context. An entity can be a company, market, country, etc. Typically, sentiment Signals have discrete states, such as positive, neutral, or negative. (Example: Current Sentiment on X Corporate Bonds is Positive. ) 2. Behavior: Captures an underlying fundamental behavioral pattern for a given entity (e.g. consumer) or a given dataset. These Signals are most often a time series and depend on the type of behavior being tracked and assessed. Examples of behavior Signals include aggregate money flow into ETFs, number of 30 days past due in last year for a credit card account, and propensity to buy a given product. 3. Event/Anomaly: Discrete in nature and used to trigger certain actions or alerts when a certain threshold condition is met. Examples include ATM withdrawal that exceeds 3X the daily average or a bond rating downgrade by a rating agency. 4. Membership/Cluster: Designate where an entity belongs, given a dimension. For example, gaming establishments create clusters of their customers based on spend high rollers, casual gamers, etc. Wealth management firms can create clusters of their customers based on monthly portfolio turnover such as frequent traders, buy and hold, etc. 5. Correlation: Continuously measure the correlation of various entities and their attributes throughout a time series of values between 0 and 1. Examples include correlation of stock prices within a sector, unemployment and retail sales, interest rates and GDP, or home prices and interest rates. Signals have attributes based on their representation in time or frequency domains. In a time domain, a Signal can be continuous-time or discrete-time. An output from a blood pressure monitor is an example of a continuous-time Signal; the daily market close values of the Dow Jones Index is an example of a discrete-time Signal. Within the frequency domain, Signals can be defined as high or low frequency. For example, the asset allocation trends of a brokerage account can be measured every 15 minutes, daily, and monthly. Depending on the frequency of measurement, a Signal derived from the underlying data can be fast-moving or slow-moving. Figure 1: Advanced techniques used in Signal discovery

5 5 Identifying, extracting, and calculating Signals at scale from noisy Big Data requires a set of predefined Signal schema and a variety of algorithms. A Signal schema is a specific type of template used to transform data into Signals. Different types of schema may be used, depending on the nature of the data, the domain, and the business environment. Figure 1 details some of the techniques we use for initial Signal discovery. Signal Hub Platform Opera Solutions Signal Hub integrates Big Data from both inside and outside the enterprise; provides the technology to identify, extract, and store Signals; and supports deployment of Big Data applications. It addresses Big Data challenges in a consistent and repeatable way, which greatly accelerates the delivery of business value. From a technical perspective, the easiest way to understand the architecture depicted in Figure 2 is to follow the full lifecycle of data as it is processed by the platform, organized into three major themes: batch processing, interactive processing, and analytic development. K L J H P G I O B F A C E N D M Figure 2: Signal Hub Reference Architecture

6 6 Batch Processing Much of the heavy lifting in a Signal Hub is handled using batch processes. While users often ask about real-time or just-in-time processing, the reality is that many state-of-the-art algorithms are fed by enterprise systems that use batch processing in order to integrate with other batch systems. The Signal Hub receives batch data via SFTP or a landing zone directory and real-time data via HTTP or MQ. Ultimately, the work of the Signal Hub begins once data is made available either through file transfer or via an API. The components of batch processing include the following: A. The base layer of the batch-processing stack is a workflow engine that is configured to execute all batch-processing work streams. It ensures that all new data is properly processed and all exceptions and alerts are handled. Processing is depicted in the reference architecture from bottom to top with the workflow engine coordinating across this entire lifecycle. This processing can run at any required frequency and can be triggered by the arrival of data to the landing zone directory, according to a schedule, or by a defined event. B. The data flow engine layer is the transformation workhorse of the batch system. Data flows are configured declaratively, specified in a specialized language and leveraged by an internally maintained common library of data operators and connectors. The engine is responsible for executing specific data flows on specific data when initiated by the workflow engine. These processes also produce metadata used to feed downstream processes. This abstraction provides two important capabilities: a. Common elements of data processing are extracted into reusable operators and connectors, which allow the flow specification to be tailored for each Signal Hub. b. Flexible execution environments allow the data flow engine to operate against different data infrastructures and storage systems without requiring rewriting of the flow definitions or operators. This allows Signal Hubs to grow from simple flat-file processing to scaled-out systems like Hadoop. It also allows us to push processing down to the underlying infrastructure, thus leveraging existing capabilities that might exist at a customer site. C. The data management layer is the backbone of the system and is decoupled from the processing logic because we employ a variety of technologies. For example, Hadoop is used for the largest input data sets, where extensive transformation is required. We also leverage columnar and in-memory data stores for certain workloads. In some cases, even traditional relational database technology is sufficient. The data management layer is fed by the variety of connectors in the data flow engine layer and present prepared data sets via uniform interfaces to the Signal processing logic. We leverage industry-standard interfaces such as JDBC, where applicable, and have developed our own abstractions for less standardized technologies such as column-family data stores. It is not uncommon to see a mix of such technologies at various stages of processing within a single Signal Hub. D. Intelligent ETL handles all of the data quality management, mapping, linking and structuring of the data that arrives. The intelligence comes from a deep understanding of the data

7 7 sources, enabling monitoring for statistical deviations and a system for alerting when they occur. The results are clean data that are put into the data management layer for subsequent processing. E. The SigGMS layer is responsible for calculating all of the Signals from Big Data and also managing Signal metadata. In batch mode, SigGMS is executed in a data flow, streaming data from the data management layer through the Signal code. In real-time mode, Signal services read and write data directly from the data management layer using the appropriate data management APIs for random access (e.g. JDBC, Key/Value store, column-family). F. The batch analytics layer contains a variety of machine-learning and predictive modeling capabilities that we employ to service applications. These models consume Signals generated by SigGMS and similarly process data either in data flow streams (batch) or as callable services. G. Batch data is either kept in the batch infrastructure for long-term use by the interactive layer or staged in a dedicated data-store in the interactive layer. The distinction is made based on the requirements of the interactive layer. Often a great deal of data decimation occurs during the transfer. Also, the shape of the data storage may be changed at this point to optimize for ad-hoc queries in the interactive layer. Interactive Processing Everything depicted above the batch layer collectively forms the interactive layer. Sometimes referred to as real-time, the distinguishing feature of the interactive layer is that it is invoked on-demand rather than a-priori. Actual quality-of-service requirements vary from milliseconds to several seconds depending on integration requirements and the types of interactions that occur. For example, a credit card fraud decision must occur as part of a larger overall processing chain and must occur in milliseconds, while a request from an interactive Website can take 100ms before crossing the human perception threshold. Some interactive services, such as executing what-if simulations based on user input may take tens of seconds to execute but are still returned interactively rather than queued for a nightly batch. Ultimately, the timing requirements drive how much work can be done and how much computing power is needed to do that work in the time allowed. The steps involved in interactive processing include the following: A. The interactive layer s data access API abstracts any differences between storage technologies. This API is very similar to the batch storage API in that it leverages standards where possible but allows us to define extended APIs for nonstandard technologies. B. The online scoring portion of the architecture is an optional set of services that are instantiated for use cases where models need to be invoked on-demand. This is needed for cases where real-time information is needed to formulate a response such as a fraud score for a credit card transaction. C. The Signal Hub API is the interaction gateway to the Signal Hub. It is realized as a Java

8 8 Application Server that includes the ability to handle all real-time transaction and event flows, real-time queries of Signals and model data, and live feedback from the external systems. By default, we provide these services through a RESTful API but can also provide connectors for SOAP based integration or MQ-style integration. D. Applications and data visualization live outside of the Signal Hub but consume Signal Hub content via the Signal Hub API. The creation of these connections is not typically part of a Signal Hub engagement, but they should be designed and created in a way that ensures proper integration and adoption. E. Signal Monitor is an out-of-the-box application provided by Opera Solutions, which leverages information within the Signal Hub to help monitor Signals for exceptional situations and continued relevance of predictive power. Architecturally, the Signal Monitor lives outside of the Signal Hub, interacting through the same API available to any application, but it delivers important functionality, which is generally useful in Signal Hub deployments. Analytic Development While the batch and interactive systems are an integration of many enabling technologies that form a Signal Hub, the task of defining precisely how a Signal Hub should operate in any given data environment cannot be removed from the equation through any amount of engineering or automation. It is the way in which this system supports the first-time and ongoing development of Signals, which help realize the promise of Big Data. Our data scientists can tap into the data flow at various points, as depicted in the reference architecture. In doing so, they accomplish the following: A. Incorporate new data types by enhancing the meta data used to drive Intelligent ETL. B. Discover and implement novel Signals based on new experiments or adapt existing Signals to new business domains. C. Retrain models to capture nonstationary aspects of problem domain, such as accounting for drift in input data. D. Get feedback from the interactive layer about how Signals are being used to better inform the aforementioned continuous improvement. The symbiotic nature of the services provided by the Signal Hub and the scientists who constantly mine vast and disparate data sources for value is the way in which Opera Solutions is able to offer a data science platform that continually adapts to real-world complexity and an evolving data landscape, all without placing unrealistic demands on our customers.

9 9 Profit from Big Data flow Jersey City Boston San Diego London Shanghai New Delhi ABOUT OPERA SOLUTIONS, LLC Opera Solutions combines advanced science, technology, and domain knowledge to extract predictive intelligence from Big Data and turn it into insights and recommended actions that help people make smarter decisions, work more productively, serve their customers better, grow revenues, and reduce expenses. Its hosted solutions, delivered as a service, are today delivering results in some of the world s most respected organizations in financial services, healthcare, hospitality, telecommunications, and government. Opera Solutions is headquartered in Jersey City, NJ, with other offices in North America, Europe, and Asia. For more information, visit the website or call OPERA-22.