Infosys Labs Briefings
|
|
|
- Kenneth Matthews
- 9 years ago
- Views:
Transcription
1 Infosys Labs Briefings VOL 11 NO BIG DATA: CHALLENGES AND OPPORTUNITIES $ $
2 Big Data: Countering Tomorrow s Challenges Infosys Labs Briefings Advisory Board Anindya Sircar PhD Associate Vice President & Head - IP Cell Gaurav Rastogi Vice President, Head - Learning Services Kochikar V P PhD Associate Vice President, Education & Research Unit Raj Joshi Managing Director, Infosys Consulting Inc. Ranganath M Vice President & Chief Risk Officer Simon Towers PhD Associate Vice President and Head - Center of Innovation for Tommorow s Enterprise, Infosys Labs Subu Goparaju Senior Vice President & Head - Infosys Labs Big data was the watchword of year Even before one could understand what it really meant, it began getting tossed about in huge doses in almost every other analyst report. Today, the World Wide Web hosts upwards of 800 million webpages, each page trying to either educate or build a perspective on the concept of Big data. Technology enthusiasts believe that Big data is the next big thing after cloud. Big data is of late being adopted across industries with great fervor. In this issue we explore what the Big data revolution is and how it will likely help enterprises reinvent themselves. As the citizens of this digital world we generate more than 200 exabytes of information each year. This is equivalent to 20 million libraries of Congress. According to Intel, each internet minute sees 100,000 tweets, 277,000 Facebook logins, 204-million exchanges, and more than 2 million search queries fired. Looking at the scale at which data is getting churned it is beyond the scope of a human s capability to process data and hence there is need for machine processing of information. There is no dearth of data for today s enterprises. On the contrary, they are mired with data and quite deeply at that. Today therefore the focus is on discovery, integration, exploitation and analysis of this overwhelming information. Big data may be construed as the technological intervention to undertake this challenge. Since Big data systems are expected to help analysis of structured and unstructured data and hence are drawing huge investments. Analysts have estimated enterprises will spend more than US$120 billion by 2015 on analysis systems. The success of Big data technologies depends upon natural language processing capabilities, statistical analytics, large storage and search technologies. Big data analytics can help cope with large data volumes, data velocity and data variety. Enterprises have started leveraging these Big data systems to mine hidden insights from data. In the first issue of 2013, we bring to you papers that discuss how Big data analytics can make a significant impact on several industry verticals like medical, retail, IT and how enterprises can harness the value of Big data. Like always do let us know your feedback about the issue. Happy Reading, Yogesh Dandawate Deputy Editor [email protected]
3 Infosys Labs Briefings VOL 11 NO Opinion: Metadata Management in Big Data By Gautham Vemuganti Any enterprise that is in the process of or considering Big data applications deployment has to address the metadata management problem. The author proposes a metadata management framework to realize Big data analytics. Trend: Optimization Model for Improving Supply Chain Visibility By Saravanan Balaraj The paper tries to explore the challenges that dot the Big data adoption in supply chain and proposes a value model for Big data optimization. Discussion: Retail Industry Moving to Feedback Economy By Prasanna Rajaraman and Perumal Babu Big data analysis of customers preferences can help retailers gain a significant competitive advantage, suggest the authors. Perspective: Harness Big Data Value and Empower Customer Experience Transformation By Zhong Li PhD Always-on digital customers continuously create more data in various types. Enterprise are analyzing this heterogeneous data for understanding customer behavior, spend, social media patterns. Framework: Liquidity Risk Management and Big Data: A New Challenge for Banks By Abhishek Kumar Sinha Managing liquidity risk on simple spreadsheets can lead to non-real-time and inappropriate information that may not be enough for efficient liquidity risk management (LRM). The author proposes an iterative framework for effective liquidity risk management. Model: Big Data Medical Engine in the Cloud (BDMEiC): Your New Health Doctor By Anil Radhakrishnan and Kiran Kalmadi In this paper the authors describe how Big data analytics can play a significant role in the early detection and diagnosis of fatal diseases, reduction in health care costs improving quality of health care administration. Approach: Big Data Powered Extreme Content Hub By Sudeeshchandran Narayanan and Ajay Sadhu With the arrival of Big Content, the need to extract, enrich, organize and manage semi-structured and un-structured content and media is increasing. This paper talks about the need for an Extereme Content Hub to tame the Big data explosion. Insight: Complex Events Processing: Unburdening Big Data Complexities By Bill Peer, Prakash Rajbhoj and Narayanan Chathanur Complex Event Processing along with in-memory data grid technologies can help in pattern detection, matching, analysis, processing and split second decision making in Big data scenarios opine the authors. Practioners Perspective: Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja This paper suggests the need for a robust testing approach to validate Big data systems to identify possible defects early in the implementation life cycle. Research: Nature Inspired Visualization of Unstructured Big Data By Aaditya Prakash Classical visualization methods are falling short in accurately representing the multidimensional and ever growing Big data. Taking inspiration from nature, the author has proposed a nature inspired spider cobweb visualization technique for visualization of Big data Index
4 Robust testing approach needs to be defined for validating structured and unstructured data to identify possible defects early in the implementation life cycle. Naju D. Mohan Delivery Manager, RCL Business Unit Infosys Ltd. Big Data augmented with Complex Event Processing capabilities can provide solutions in utilizing memory data grids for analyzing trends, patterns and events in real time. Bill Peer Principal Technology Architect Infosys Labs, Infosys Ltd.
5 Infosys Labs Briefings VOL 11 NO Metadata Management in Big Data By Gautham Vemuganti Big data analytics must reckon the importance and criticality of metadata Big data, true to its name, deals with large volumes of data characterized by volume, variety and velocity. Any enterprise that is in the process of or considering a Big data applications deployment has to address the metadata management problem. Traditionally, much of the data that business users use is structured. This however is changing with the exponential growth of data or Big data. Metadata defining this data, however, is spread across the enterprise in spreadsheets, databases, applications and even in people s minds (the so-called tribal knowledge ). Most enterprises do not have a formal metadata management process in place because of the misconception that it is an Information Technology (IT) imperative and it does not have an impact on the business. However, the converse is true. It has been proven that a robust metadata management process is not only necessary but required for successful information management. Big data introduces large volumes of unstructured data for analysis. This data could be in the form of a text file or any multimedia file (for e.g., audio, video). To bring this data into the fold of an information management solution, its metadata should be correctly defined. Metadata management solutions provided by various vendors usually have a narrow focus.an ETL vendor will capture metadata for the ETL process.a BI vendor will provide metadata management capabilities for their BI solution. The silo-ed nature of metadata does not provide business users an opportunity to have a say and actively engage in metadata management. A good metadata management solution must provide visibility across multiple solutions and bring business users into the fold for a collaborative, active metadata management process. METADATA MANAGEMENT CHALLENGES Metadata, simply defined, is data about data. In the context of analytics some common examples of metadata are report definitions, table definitions, meaning of a particular master data entity (sold-to customer, for example), ETL mappings and formulas and computations. The importance of metadata cannot be overstated. Metadata drives the accuracy of reports, validates data transformations, ensures 3
6 Single monolithic governance process Multiple governance process People Rules People Rules Metrics Process Metrics Process People Metrics Rules Process People Metrics Rules Process Figure 1: Data Governance Shift with Big Data Analytics Source: Infosys Research accuracy of calculations and enforces consistent definition of business terms across multiple business users. In a typical large enterprise which has grown by mergers, acquisitions and divestitures, metadata is scattered across the enterprise in various forms as noted in the introduction. In large enterprises, there is wide acknowledgement that metadata management is critical but most of the time there is no enterprise level sponsorship of a metadata management initiative.even if there is, it is only focused either for one specific project sponsored by one specific business. The impact of good metadata management practices are not consistently understood across the various levels of the enterprise. Conversely, the impact of poorly managed metadata comes to light only after the fact i.e., a certain transformation happens, a report or a calculation is run or two divisional data sources are merged. Metadata is typically viewed as the exclusive responsibility of the IT organization with business having little or no input or say in its management. The primary reason is that there are multiple layers of organization between IT and business. This introduces communication barriers between IT and business. Finally, metadata is not viewed as a very exciting area of opportunity.it is only addressed as an after-thought. DIFFERENCES BETWEEN TRADITIONAL AND BIG DATA ANALYTICS In traditional analytics, implementations data is typically stored in a data warehouse. The data warehouse is modeled using one of several techniques, developed over time and is a constantly evolving entity. Analytics 4
7 application developed using the data in a data warehouse are also long-lived. Data governance in traditional analytics is a centralized process. Metadata is managed as part of the data governance process. In traditional analytics, data is discovered, collected, governed, stored and distributed. Big data introduces large volumes of unstructured data.this data changes is highly dynamic and therefore needs to be ingested quickly for analysis. Big data analytics applications, however, are characterized by short-lived, quick implementations focused on solving a specific business problem.the emphasis of Big data analytics applications is more on experimentation and speed as opposed to long drawn out modeling exercise. The need to experiment and derive insights quickly using data changes the way data is governed. In traditional analytics there is usually one central governance team focused on governing the way data is used and distributed in the enterprise.in Big data analytics, there are multiple governance processes in play simultaneously, each geared towards answering a specific business question. Figure 1 illustrates this. Most of the metadata management challenges we referred to in the previous section alluded to typical enterprise data that is highly structured. To analyze unstructured data, additional metadata definitions are necessary. To illustrate the need to enhance metadata to support Big data analytics, consider sentiment analysis using social media conversations as an example. Say someone posts a message on Facebook I do not like my cell-phone reception. My wireless carrier promised wide cell coverage but it is spotty at best.i think I will switch carriers. To infer the intent of this customer, the inference engine has to rely on metadata as well as the supporting domain ontology. The metadata will define Wireless Carrier, Customer, Sentiment and Intent.The inference engine will leverage the ontology dependent on this metadata to infer that this customer wants to switch cell phone carriers. Big data is not just restricted to text.it could also contain images, videos, and voice files. Understanding, categorizing and creating metadata to analyze this kind of non-traditional content is critical. It is evident that Big data introduces additional challenges in metadata management.it is clear that there is a need for a robust metadata management process which will govern metadata with the same rigor as data for enterprises to be successful with Big data analytics. To summarize, a metadata management process specific to Big data should incorporate the context and intent of data, support nontraditional sources of data and be robust to handle the velocity of Big data. ILLUSTRATIVE EXAMPLE Consider an existing master data management system in a large enterprise.this master data system has been developed over time.this has specific master data entities like product, customer, vendor, employee etc.the master data system is tightly governed and data is processed (cleansed, enriched and augmented) before it is loaded into the master data repository. This specific enterprise is considering bringing in social media data for enhanced customer analytics.this social media data is to be sourced from multiple sources and incorporated into the master data management system. As noted earlier, social media conversations have context, intent and sentiment.the context refers to the situation 5
8 in which a customer was mentioned, the intent refers to the action that an individual is likely to take and the sentiment refers to the state of being of the individual. For example, if an individual sent a tweet or a starts a Facebook conversation about a retailer from a football game. The context would then be a sports venue. If the tweet or conversation consisted of positive comments about the retailer then the sentiment would be determined as positive. If the update consisted of highlighting a promotion by the retailer then the intent would be to collaborate or share with the individual s network. If such social media updates have to be incorporated into any solution within the enterprise then the master data management solution has to be enhanced with metadata about Context, Sentiment and Intent. Static lookup information will need to be generated and stored so that an inference engine can leverage this information to provide inputs for analysis. This will also necessitate a change in the back-end.the ETL processes that are responsible METADATADISCOVERY for this master data will now have to incorporate the social media data as well. Furthermore, the customer information extracted from these feeds need to be standardized before being loaded into any transaction system. FRAMEWORK FOR METADATA MANAGEMENT IN BIG DATA ANALYTICS We propose that metadata be managed using 5 components shown in Figure 2. Metadata Discovery Discovering metadata is critical in Big data for the reasons of context and intent noted in the prior section. Social data is typically sourced from multiple sources.all these sources will have different formats. Once metadata for a certain entity is discovered for one source it needs to be harmonized across all sources of interest. This process for Big data will need to be formalized using metadata governance. Metadata Collection A metadata collection mechanism should be implemented. A robust collection mechanism should aim to minimize or eliminate metadata silos. Once again, a technology or a process for metadata collection should be implemented. Collect METADATA COLLECTION METADATA GOVERNANCE METADATASTORAGE METADATADISTRIBUTION Figure 2: Metadata Management Framework for Big Data Analytics Source: Infosys Research Metadata Governance Metadata creation and maintenance needs to be governed. Governance should include resources from both the business and IT teams. A collaborative framework between business and IT should be established to provide this governance. Appropriate processes (manual or technical) should be utilized for this purpose. For example, on-boarding a new Big data source should be a collaborative effort between business users and IT. IT will provide the technology to enable business users discover metadata. 6
9 METADATA DISCOVERY DATA DISCOVERY Collect METADATA COLLECTION Collect DATA COLLECTION METADATA GOVERNANCE DATA GOVERNANCE METADATA STORAGE DATA STORAGE METADATA DISTRIBUTION DATA DISTRIBUTION BIG DATA DISTRIBUTION Figure 3: Equal Importance of Metadata & Data Processing for Big Data Analytics Source: Infosys Research Metadata Storage Multiple models for enterprise metadata storage exist.the Common Warehouse Meta-model (CWM) is one example. A similar model or its extension thereof can be utilized for this purpose.if one such model will not fit the requirements of an enterprise then suitable custom models can be developed. Metadata Distribution This is the final component. Metadata, once stored will need to be distributed to consuming applications.a formal distribution model should be put into place to enable this distribution. For example, some applications can directly integrate to the metadata storage layer while others will need some specialized interfaces to be able to leverage this metadata. We note that in traditional analytics implementation, a framework similar to the one we propose exists but with data. The metadata management framework should be implemented alongside a data management framework to realize Big data analytics. THE PARADIGM SHIFT The discussion in this paper brings to light the importance of metadata and the impact it has not only on Big data analytics but traditional analytics as well.we are of the opinion that if enterprises want to get value out of their data assets and leverage the Big data tidal wave then the time is right to shift the paradigm from data governance to metadata governance and make data management part of the metadata governance process. A framework is as good as how it is viewed and implemented within the enterprise. The metadata management framework is successful if there is sponsorship for this effort from the highest levels of management.this 7
10 include both business and IT leadership within the enterprise. The framework can be viewed as being very generic. Change is a constant in any enterprise.the framework can be made flexible to adapt to changing needs and requirements of the business. All the participants and personas in engaged in the data management function within an enterprise should participate in the process. This will promote and foster collaboration between business and IT.This should be made sustainable and followed diligently by all the participants until this framework is used to onboard not only new data sources but also new participants in the process. Metadata and its management is an oft ignored area in enterprises with multiple consequences.the absence of robust metadata management processes lead to erroneous results, project delays and multiple interpretations of business data entities. These are all avoidable with a good metadata management framework. The consequences affect the entire enterprise either directly or indirectly.from the lowest level employee to the senior most executive, incorrect or poorly managed metadata not only will affect operations but also directly contribute to the top-line growth and bottom-line profitability of an enterprise. Big data is viewed as the most important innovation that brings tremendous value to enterprises. Without a proper metadata management framework, this value might not be realized. CONCLUSION Big data has created quite a bit of buzz in the market place.pioneers like Yahoo and Google created the foundations of what is today called Hadoop.There are multiple players in the Big data market today developing everything from technology to manage Big data to applications needed to analyze Big data to companies engaged in Big data analysis and selling that content. In the midst of all the innovation in the Big data space, metadata is often forgotten. It is important for us to recognize and realize the importance of metadata management and the critical impact it has on enterprises. If enterprises wish to remain competitive, they have to embark on Big data analytics initiatives.in this journey, enterprises cannot afford to ignore the metadata management problem. REFERENCES 1. Davenport, T., and Harris, J., (2007), Competing on Analytics The New Science of Winning, Harvard Business School Press. 2. Jennings, M., What role does metadata management play in enterprise information management (EIM)?. Available at searchbusinessanalytics.techtarget.com/ answer/the-importance-of-metadatamanagement-in-eim. 3. Metadata Management Foundation Capabilities Component (2011). mike2.openmethodology.org/wiki/ Metadata_Management_Foundation_ Capabilities_Component. 4. Rogers, D. (2010), Database Management: Metadata is more important than you think. Available at com/sqletc/article.php/ / Database-Management-Metadata-is-moreimportant-than-you-think.htm. 5. Data Governance Institute, (2012), The DGI Data Governance Framework. Available a t com/fw_the_dgi_data_governance_ framework.html. 8
11 Infosys Labs Briefings VOL 11 NO Optimization Model for Improving Supply Chain Visibility By Saravanan Balaraj Enterprises need to adopt different Big data analytic tools and technologies to improve their supply chains In today s competitive lead or leave marketplace, Big data is seen as an oxymoron that offers both challenge as well as opportunity. Effective and efficient strategies to acquire, manage and analyze data leads to better decision making and competitive advantage. Unlocking potential business value out of this diverse and multi-structured dataset beyond organizational boundary is a mammoth task. We have stepped into an interconnected and intelligent digital world where convergence of new technologies is fast happening round the corners. In this process the underlying data set is growing not only in volumes but also in velocity and variety. The resulting data explosion created by a combination of mobile devices, tweets, social media, blogs, sensors and s demands a new kind of data intelligence. Big data has started creating lot of buzz across verticals and Big data in supply chain is no different. Supply chain is one of the key focus areas that are undergoing transformational changes in the recent past. Traditional supply chain applications leverage only on transactional data to solve operational problems and improve efficiency. Having stepped into Big data world, the existing supply chain applications have become obsolete as they are unable to cope up with tremendously increasing data volumes cutting across multiple sources, the speed with which they are generated and unprecedented growth in new data forms. Enterprises are in tremendous pressure to solve new problems emerging out of new forms of data. Handling large volume of data across multiple sources and deriving value out of this massive chunk for strategy execution is the biggest challenge that enterprises are facing in today s competitive landscape. Careful analysis and appropriate usage of these data would result in cost-reduction and better operational performance. Competitive pressures and customers more for less 9
12 attitudes have left enterprise with no choice other than to re-think on their supply chain strategies and creating a differentiation. Enterprises need to adopt appropriate Big data techniques and technologies and build suitable models to derive value out of these unstructured data and henceforth plan, schedule and route in a cost-effective manner. The paper tries to explore what are the challenges that dot the Big data adoption in supply chain and proposes a value model for Big data optimization. BIG DATA WAVE International Data Corporation (IDC) has predicted that Big data market will grow from $3.2 billion in 2010 to $16.9 billion by 2015 at a compound annual growth rate of 40% [2]. This shows tremendous traction towards Big data tools, technologies and platforms among enterprises. Lots of researches and investments are carried out on how to fully tap the potential benefits hidden in Big data and derive financial value out of it. Value derived out of Big data enables enterprises to achieve differentiation by reducing cost, efficient planning and thereby improving process efficiency. Big data is an important asset in supply chain which enterprises are looking forward to capitalize upon. They adopt different Big data analytic tools and technologies to improve their supply chain, production and customer engagement processes. The path towards operational excellence is facilitated through efficient planning and scheduling of production and logistic processes. Though supply chain data is really huge, it brings about the biggest opportunity for enterprises to reduce cost and improve their operational performances. The areas in supply chain planning where Big data can create an impact are: demand forecasting, inventory management, production planning, vendor management and logistics optimization. Big data can improve supply chain planning process if appropriate business models are identified, designed, built and then executed. Some of its key benefits are: short time-to-market, improved operational excellence, cost reduction and increased profit margins. CHALLENGES WITH SUPPLY CHAIN PLANNING Supply chain planning process success depends on how closely demands are forecasted, inventories are managed and logistics are planned. Supply chain is the heart of industry vertical and if managed efficiently drives positive business and enables sustainable advantage. With the emergence of Big data, optimizing supply chain processes has become complicated than ever before. Handling Big data challenges in supply chain and transforming them into opportunities is the key to corporate success. The key challenges are: Volume - According to a McKinsey report, the number of RFID tags sold globally is projected to increase from 12 million in 2011 to 209 billion in 2021 [3]. Along with this, phenomenal increase in the usage of temperature sensors, QR codes and GPS devices, the underlying supply chain data generated has multiplied manifold beyond our expectations. Data is flowing across multiple systems and sources and hence they are likely to be error-prone and incomplete. Handling such huge data volumes is a challenge. 10
13 Launch Customer Promotion Inventory Transportation Data Sourcing Sensor RFID QR Temperature Structured Unstructured New Type Transactional Time bound Social Channel Video Voice Digital Image Data Extraction & Cleansing Transactional Systems Big Data Systems OLTP Cascading Hive Pig MapReduce HDFS NoSQL Data Representation DB Acquire Figure 1: Optimization Model for Improving Supply Chain Visibility - I Source: Infosys Research Velocity - Business has become highly dynamic and volatile. The changes arising due to unexpected events must be handled in a timely manner in order to avoid losing out in business. Enterprises are finding it extremely difficult to cope up with this data velocity. Optimal decisions must be made quickly and shorter processing time is the key for successful operational execution which is lacking in traditional data management systems. Variety - In supply chain, data has emerged in different forms which don t fit in traditional applications and models. Structured (transactional data), unstructured (social data), sensor data (temperature and RFID) along with new data types (video, voice and digital images) have created nightmares among enterprise to handle such diverse and heterogeneous data sets. In today s data explosion in terms of volume, variety and velocity, handling them alone doesn t suffice. Value creation by analyzing such massive data sets and extraction of data intelligence for successful strategy execution is the key. BIG DATA IN DEMAND FORECASTING & SUPPLY CHAIN PLANNING Enterprises use forecasting to determine how much to produce of each product type, when 11
14 and where to ship them, thereby improving supply chain visibility. Inaccurate forecast causes detrimental effect in supply chain. Over-forecast results in inventory pile ups and working capital locks. Under-forecast leads to failure in meeting demand, resulting in loss of customer and sales. Hence in today s volatile market comprised of unpredictable shifts in customer demands, improving accuracy of forecast is of paramount importance. Data in supply chain planning has mushroomed in terms of volumes, velocity and variety. Tesco, for instance, generates more than 1.5 billion new data items every month. Wal-Mart s warehouse handles some 2.5 petabytes of information which is roughly equivalent to half of all the letters delivered by the US Postal Service in According to McKinsey Global institute report [3], leveraging on Big data in demand forecasting and supply chain planning could increase profit margin by 2-3% in Fast Moving Consumer Goods (FMCG) manufacturing value chain. This unearths tremendous opportunity in forecasting and supply chain planning available for enterprises to capitalize on this Big data deluge. MISSING LINKS IN TRADITIONAL APPROACHES Enterprises have started realizing the importance of Big data in forecasting and have begun investing in Big data forecasting tools and technologies to improve their supply chain, production and manufacture planning processes. Traditional forecasting tools aren t adequate enough in handling huge data volumes, variety and velocity. Moreover they are missing out on the following key aspect which improves accuracy of forecasts: Social Media Data As An Input: Social media is a platform that enables enterprises to collect information about potential and prospect customers. Thanks to the technological advancements that has made tracking customer data easier. Companies can now track every visit customer makes to the websites, exchanged and comments logged across social media websites. Social media data helps analyze customer pulse and gain insights on forecasting, planning, scheduling of supply chain and inventories. Buzz in social networks can be used as an input for demand forecasting for numerous benefit. One such use case is, enterprise can launch a new product to online fans to sense customer acceptance. Based on the response, inventories and supply chain can be planned to direct stocks to high buzz locations during launch phase. Predict And Respond Approach: Traditional forecasting is done by analyzing historical patterns, considering sales inputs and promotional plans to forecast demand and supply chain planning. They focus on what happened and work on sense and respond strategy. History repeats itself is no longer apt in todays competitive marketplace. Enterprises need to focus on what will happen and require predict and respond strategy to stay alive in business. This calls for models and systems capable of capturing, handling and analyzing huge volume of real-time data generated from unexpected competitive events, weather patterns, point-of-sales and 12
15 natural disasters (volcanoes, floods, etc.) and converting them into actionable information for forecasting plans on production, inventory holdings and supply chain distribution. Optimized Decisions with Simulations: Traditional decision support systems lack flexibility to meet changing data requirements. In real world scenario, supply chain delivery plan changes unexpectedly due to various reasons like demand change, revised sales forecast, etc. The model and system should have ability to factor in this and respond quickly to such unplanned events. Decision should be taken only after careful analysis of the unplanned events impact on other elements of supply chain. Traditional approaches lack this capability and this necessitates a model for performing what-if analysis on all possible decisions and selecting the optimal one in the Big data context. IMPROVING SUPPLY CHAIN VISIBILITY USING BIG DATA Supply chain doesn t lack data what s missing is a suitable model to convert this huge diverse raw data into actionable information so that enterprises can make critical business decisions for efficient supply chain planning. A 3-stage optimized value model helps to overcome the challenges posed by Big data in supply chain planning and demand forecasting. It bridges the existing gaps in traditional Big data approaches and offers a perspective to unlock the value from growing Big data torrent. Designing and building an optimized Big data model for supply chain planning is a complex task but successful execution leads to significant financial benefits. Let s take a deep dive into each stage of this model and analyze what their value-add are in enterprises supply chain planning process. Acquire Data: The biggest driver of supply chain planning is data. Acquiring all the relevant data for supply chain planning is the first step in this optimized model. It involves three steps namely data sourcing, data extraction and cleansing and data representation which make data ready for further analysis. Data Sourcing - Data is available in different forms across multiple sources, systems and geographies. It contains extensive details of historical demand data and other relevant information. For further analysis it is therefore necessary to source required data. Data that are to be sourced for improving accuracy of forecast in-addition to transactional data are: Product Promotion data - items, prices, sales Launch data - items to be ramped up or down Inventory data - stock in warehouse Customer data - purchase history, social media data Transportation data - GPS and logistics data. Enterprises should adopt appropriate Big data systems that are capable of handling such huge data volumes, variety and velocity. 13
16 Data Extraction and Cleansing - Data sources are available in different forms from structured (transactional data) to un-structured (social media, images, sensor data, etc.) and they are not in analysis-friendly formats. Also due to large volume of heterogeneous data there is high probability of inconsistencies and data errors while sourcing. The sourced data should be expressed in structured form for supply chain planning. Moreover analyzing inaccurate and untimely data leads to erroneous non-optimal results. High quality and comprehensive data is a valuable asset and appropriate data cleansing mechanisms should be in place for maintaining the quality of Big data. Choice of Big data tools for data cleansing and enrichment plays a crucial role in supply chain planning. Data Representation Database design for such huge data volume is a herculean task and poses some serious performance issues if not executed properly. Data representation plays a key role in Big data analysis. There are numerous ways to store data and each design has its own set of advantages and drawbacks. Selection of appropriate database design and executing appropriate design favoring business objectives reduces the efforts in reaping benefits out of Big data analysis in supply chain planning. Analyze Data: The next stage is analyzing cleansed data and capturing value for forecasting and supply chain planning. There is plethora of Big data techniques available in market for forecasting and supply chain planning. The selection of Big data technique depends on the business scenario and enterprise objectives. Incompatible data formats make value creation from Big data a complex task and this calls for innovation in techniques to unlock business value out of the growing Big data torrent. The proposed model adopts optimization technique to generate insights out of this voluminous and diverse Big dataset. Optimization in Big data analysis - Manufacturers have started synchronizing forecasting with production cycles, so accuracy of forecasting plays a crucial role in their success. Adoption of optimization technique in Big data analysis creates a new perspective and it helps in improving the accuracy of demand forecasting and supply chain planning. Analyzing the impact of promotions on one specific product for demand forecasting appears to be an easy task. But real life scenarios comprises of huge army of products with factors affecting their demand varying for every product and location making it difficult for traditional techniques in data analysis. Optimization technique has several capabilities which make it an ideal choice for data analysis in such scenarios. Firstly, this technique is designed for analyzing and drawing insights for highly complex system with huge data volumes, multiple constraints and factors to be accounted for. Secondly, supply chain planning has number of enterprise objectives associated with it like cost reduction, demand fulfillment, etc. The impact of each of these objective measures on enterprises profitability can be easily analyzed using optimization 14
17 GOALS Min (Cost) Max (Profit) Max (Demand Coverage) Data Sourcing Data Extraction & Cleansing Data Representation ACQUIRE OPTIMIZATION TECHNIQUE INPUT OUTPUT Inventory Plan Demand Plan Logistics Plan ANALYZE Scenario Management CONSTRAINTS Capacity constraint Route Constraint Demand Coverage Constraint Performance Trackers KPI Dashboards Actual Vs. Planned Multi User Collaboration Compare Build Simulate ACHIEVE Figure 2: Optimization Model for Improving Supply Chain Visibility II Source: Infosys Research technique. Flexibility of optimization technique is another benefit that makes it suitable for Big data analysis to uncover new data connections and turn them into insights. Optimization model comprises of four components, viz., (i) input consistent, real-time, quality data which is sourced, cleansed and integrated becomes the input of the optimization model; (ii) goals the model should take into consideration all the goals pertaining to the forecasting and supply chain planning like minimizing cost, maximizing demand coverage, maximizing profits, etc. (iii) constraints the model should incorporate the entire constraints specific to the supply chain planning in the model; some of the constraints are minimum inventory in warehouse, capacity constraint, route constraint, demand coverage constraint, etc; and (iv) output results based on input, goals and constraints defined in the model that can be used for strategy executions. The result can be demand plan, inventory plan, production plan, logistics plan, etc. Choice of Algorithm: One of the key differentiators in supply chain planning is the algorithm used in modeling. 15
18 Optimization problems have numerous possible solutions and the algorithm should have the capability to fine-tune itself for achieving optimal solutions. Achieve Business Objective: The final stage in this model is achieving business objectives through demand forecasting and supply chain planning. It involves three steps which facilitates enterprise in supply chain decisions. Scenario Management Business events are difficult to predict and most of the times deviate from their standard paths resulting unexpected behaviors and events. This makes it difficult for planning and optimizing during uncertain times. Scenario management is the approach to overcome such uncertain situations. Scenario management facilitates creating business scenarios, comparing multiple different scenarios, analyze and assessing its impact before making decisions. This capability helps to balance conflicting KPIs and arrive at an optimal solution matching business needs. Multi User Collaboration Optimization model in real business case comprises of highly complex data sets and models which requires support from an army of analysts and determines its effects on enterprises goals. Combinations of technical and domain experts are required to obtain optimal results. To achieve near accurate forecasts and supply chain optimization the model should support multi-user collaboration so that multiple users can collaboratively produce optimal plans and schedules and re-optimize as and when business changes. This model builds a collaborative system with capability of supporting inputs from multiple users and incorporating in its decision making process Performance Tracker Demand forecasting and supply chain planning does not follow build-model-execute approach, it requires significant continuous effort. Frequent changes in the inputs and business rules necessitate monitoring of data, model and algorithm performance. Actual and planned results are to be compared regularly and steps are to be taken to minimize the deviations in accuracy. KPI is to be defined and dashboard should be constantly monitored for model performances. KEY BENEFITS Enterprises can accrue lot of benefit by adopting this 3-stage model for Big data analysis. Some of them are detailed below: Improves Accuracy of Forecast: One of the key objectives of forecasting is profit maximization. This model adopts effective data sourcing, cleansing and integration systems and makes data ready for forecasting. Inclusion of social media data, promotional data, weather predictions, seasonality s in addition to historical demand and sales histories adds value and improves forecasting accuracy. Moreover optimization technique for Big data analysis reduces forecasting errors to a great extent. Continuous Improvement: Acquire-Analyze- Achieve model is not a hard-wired model. It allows flexibility to fine tune and supports what-if analysis. Multiple scenarios can be 16
19 created, compared and simulated to identify the impact of change on the supply chain and demand forecasting prior to the making any decisions. Also it enables enterprise to define, track and monitor KPIs from time to time resulting in continuous process improvements. Better Inventory Management: Inventory data along with weather predictions, history of sales and seasonality is considered as an input to the model for forecasting and planning supply chain. This approach minimizes incidents of out-of-stock or over-stocks across different warehouses. Optimal plan for inventory movement is forecasted and appropriate stocks are maintained at each warehouse to meet the upcoming demand. To a great extent this will reduce loss of sales and business due to stockouts and leads to better inventory management. Logistic Optimization: Constant sourcing and continuous analysis of transportation data (GPS and other logistics data) and using them for demand forecasting and supply chain planning through optimization techniques helps in improving distribution management. Moreover optimization of logistics improves fuel efficiency and efficient routing of vehicles resulting in operational excellence and better supply chain visibility. CONCLUSIONS As rapid penetration of information technology in supply chain planning continues, the amount of data that can be captured, stored and analyzed has increased manifold. The challenge is to derive value out of these large volumes of data by unlocking financial benefits in congruent with the enterprises business objectives. Competitive pressures and customers more for less attitude has left enterprises with no option other than reducing cost in their operational executions. Adopting effective and efficient supply chain planning and optimization techniques to match customer expectations with its offerings is the key to corporate success. To attain operational excellence and sustainable advantage, it is necessary for the enterprise to build innovative models and frameworks leveraging the power of Big data. Optimized value model on Big data offers a unique way of demand forecasting and supply chain optimization through collaboration, scenario management and performance management. This model on continuous improvement opens up doors for big opportunities for the next generation of demand forecasting and supply chain optimization. REFERENCES 1. IDC - Press Release (2012), IDC Releases First Worldwide Big data Technology and Services Market Forecast, Shows Big data as the Next Essential Capability and a Foundation for the Intelligent Economy. Available at jsp?containerid=prus McKinsey Global Institute (2011), Big data: The next frontier for innovation, competition, and productivity. Available at McKinsey/dotcom/Insights%20and%20 pubs/mgi/research/technology%20 and%20innovation/big%20data/mgi_ big_data_full_report.ashx. 3. Furio, S., Andres, C., Lozano, S., Adenso-Diaz, B., (2009), Mathematical model to optimize land empty container movements. Available at
20 Articles/doc/presentations/HMS2009_ Paperid_27_Furio.aspx. 4. Stojkovića, G., Soumisb, F., Desrosiersc, J., Solomon, M. (2001), An optimization model for a real-time flight scheduling problem. Available at sciencedirect.com/science/article/pii/ S Beck, M., Moore, T., Plank, J., Swany, M. (2000), Logistical Networking. Available at: pdf/logisticalnetworking.pdf. 6. Lasschuit, W., Thijssen, N., (2004), Supporting supply chain planning and scheduling decisions in the oil and chemical industry, Computers and Chemical Engineering, issue 28, pp Available at com/aimms/download/case_studies/ shell_elsevier_article.pdf. 18
21 Infosys Labs Briefings VOL 11 NO Retail Industry Moving to Feedback Economy By Prasanna Rajaraman and Perumal Babu Gain better insight into customer dynamics through Big Data analytics Retail industry is going through a major paradigm shift. The past decade has seen unprecedented churn in retail industry virtually changing the landscape. Erstwhile marquee brands from traditional retailing side have ceded space to start-ups and new business models. The key driver of this change is a confluence of technological, sociological and customer behavioral trends creating this strategic infection point in retailing ecology. Trends like emergence of internet as major retailing channel, social platforms going mainstream, pervasive retailing and emergence of digital customer has presented a major challenge to traditional retailers and retailing models. On the other hand, these trends have also enabled opportunities for retailers to better understand customer dynamics. For the first time, retailers have access to unprecedented amount of publicly available information on customer behavior and trends; voluntarily shared by customers. The more effective retailers can tap into these behavioral and social reservoirs of data to model purchasing behaviors and trends of their current and prospective customers. Such data can also provide the retailers with predictive intelligence, which if leveraged effectively can create enough mindshare, that the sale is completed even before the conscious decision to purchase is taken. This move to a feedback economy where retailers can have 360 degree view of the customer thought process across the selling cycle is a paradigm shift for retail industry from retailer driving sales to retailer engaging customer across the sales and support cycle. Every aspect of retailing from assortment/ allocation planning, marketing/promotions to customer interactions has to take the evolving consumer trends into consideration. The implication from business perspective is that retailers have to better understand customer dynamics and align 19
22 Observe Orient Decide Act Unfolding Circumstances Implicit Guidance & Control Cultural Transactions Implicit Guidance & Control Outside Information Observation Feed Genetic Heritage Analysis and Synthesis Feed Forward Decision (Hypothesis) Feed Forward Action (Test) Unfolding Interaction with Enviroment Forward New Information Previous experiences Unfolding Interaction with Enviroment Feedback Feedback Feedback Figure 1: OODA loop Source: Reference [5] Source: Reference [5] business processes effectively with these trends. In addition, this implies that cycle times will be shorter and businesses have to be more tactical in their promotions and offerings. Retailers who can ride this wave will be better able to address demand and command higher margins for the products and services. Failing this, retailers will be left with low-margin pricing/commodity space. From information technology perspective, the key challenge is that nature of this information with respect to lifecycle, velocity, heterogeneousness of the sources and volume is radically different from what traditional systems handle. Also, there are overarching concerns like that of data privacy, compliance and regulatory changes that need to be internalized with internal processes. The key is to manage lifecycle of this Big data and effectively integrate with the organizational system and to derive actionable information. TOWARDS A FEEDBACK ECONOMY Customer dynamics refers to customerbusiness relationships that describe the ongoing interchange of information and transactions between customers and organizations that goes beyond the transactional nature of the interaction to look at emotions, intent and desires. Retailers can create significant competitive differentiation by understanding the customer s true intent in a way that also supports the business intents [1, 2, 3, 4]. John Boyd a colonel military strategist in the US air force developed the OODA loop (Observe, Orient, Decide and Act) which he used for combative operations. Today s business environment is nothing different Retailers are battling to get customer into their shops (physical or net-front) and convert their visits to sales. And understanding customer dynamics play a key role in this effort. The OODA loop explains the crux of the feedback economy. 20
23 In a feedback economy, there is constant feedback to the system from every phase of its execution. Along with this, the organization should observe the external environment, unfolding circumstances and customer interactions. These inputs are analyzed and action is taken based on these inputs. This cycle of adaptation and optimization makes the organization more efficient and effective on an ongoing basis. Leveraging this feedback loop is pivotal in having a proper understanding of customer needs and wants and the evolving trends. In today s environment, this means acquiring data from heterogeneous sources viz., instore transaction history, web analytics, etc. This creates a huge volume of data that has to be analyzed to get the required actionable insights BIG DATA LIFECYCLE: ACQUIRE- ANALYZE-ACTIONIZE The lifecycle of Big data can be visualized as a three-phased approach resulting in continuous optimization. The first step in moving towards feedback economy is to acquire data. In this case, retailer should look into the macro and micro environment trends, consumer behavior - their likes, emotions, etc. Data from electronic channels like blogs, social networking sites and twitter will give the retailer a humongous amount of data regarding the consumer. These feeds help the retailer understand consumer dynamics and give more insights into her buying patterns. The key advantage of plugging into these disparate sources is the sheer information one can gather about customer both individually and in aggregate. On other hand, Big data is materially different from the data the retailers are used to handling. Most of the data is unstructured (from blogs, twitter feeds, etc.) and cannot be directly integrated with traditional analytics tool leading to challenges on how the data can be assimilated with backend decision making systems and analyzed. In the assimilate/analyze phase, retailer must decide which data is of use and define rules for filtering the unwanted data. Filtering should be done with utmost care, as there are cases where indirect inferences are possible. The data available to the retailer after the acquisition phase would be of multiple formats and they have to be cleaned and harmonized with the backend platforms. Cleaned up data is then mined for actionable insights. Actionize is a phase where the insights gathered from analyze phase is converted to actionable business decisions by the retailer. The response i.e., business outcome is fed back to the system so that the system can self-tune on an ongoing basis to result in a selfadaptive system that leverages Big data and feedback loops to offer business insight more customized than what would be traditionally possible. It is imperative to understand that this feedback cycle is an ongoing process and not to be considered as a one stop solution for the analytics needs of a retailer. ACQUIRE: FOLLOWING CUSTOMER FOOTPRINTS To understand the customer, retailers have to leverage every interaction with the customer and tap into the source of customer insight. Traditionally, retailers have relied primarily on in-store customer interactions and associated transaction data along with specialized campaigns like opinion polls to gain better insight into customer dynamics. While this interaction looks limited, a recent incident shows how powerful customer sales history can be leveraged to gain predictive intelligence on customer needs. 21
24 A father of a teenage girl called in a major North American retailer to complain that the retailer had mailed coupons for child care products addressed to his underage daughter. Few days later, the same father called in and apologized that his daughter was indeed pregnant and he was not aware of it earlier [6]. Surprisingly, by all indications, only instore purchase data was mined by the retailer in this scenario to identify the customer need which in this case is that of childcare products. To exploit the power of next generation of analytics retailers must plug into data from non-traditional sources like social sites, twitter feeds, environment sensor networks, etc. to have better insight into customer needs. Most major retailers now have multiple channels brick/mortar store, online store, mobile apps, etc. Each of these touch points not only acts as a sales channel but can also generate data on customer needs and wants. Coupling this information with other repository like Facebook posts, twitter feeds (i.e., sentiment analysis) and web analytics retailers have the opportunity to track customer footprints both in and outside the store and to customize their offerings and interactions with customer. Traditionally retailers have dealt with voluminous data. For example, Wal-Mart logs more than 2.5 petabytes of information about customer transactions every hour, equivalent to 167 times the books in the Library of Congress [7]. However, the nature of Big data is materially different from traditional transaction data and this must be considered while data planning is done. Further, while data is available readily, the legality and compliance aspect of gathering and using data is additional aspect that needs to be considered. Further, integrating information from multiple sources can result in generating data that is beyond what user originally consented to; potentially resulting in liability for the retailer. Given that most of this information is accessible globally, retailers should ensure compliance with local regulations (EU data /privacy protection regulations, HIPAA for US medical data, etc.) where they operate. ANALYZE - INSIGHTS (LEADS) TO INNOVATION Analyst Doug Laney defined data growth challenges and opportunities as being threedimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources)[9]. The key to acquire Big data is to handle these dimensions while assimilating these aforementioned external sources of data. To understand how Big data analytics can enrich and enhance a typical retail process allocation planning let s look at the allocation planning case study for a major North American apparel retailer. The forecasting engine used for planning process uses statistical algorithms to determine allocation quantities. Key inputs to forecasting engine are sales history and current performance of store. In addition, adjustments are also based on parameters like Promotional events (including markdown), current stock levels, back orders to determine the inventory that needs to be shipped to particular store. While this is fairly in line with industry standard for allocation forecasting, Big data can enrich this process by including additional parameters that can impact demand. For e.g., a news piece on a town s go-green initiative or no plastic day can be taken as additional adjustment parameter for non-green items in that area. Similarly, a weather forecast on warm front in 22
25 an area can automatically trigger reduction of stocks of warm-clothing for stores there. A high-level logical view of Big data implementation is explained below to further understanding on how Big data can be assimilated with traditional data sources. The data feeds for the implementation comes from various structured sources like forums, feedback forms, rating sites and unstructured source like social web, etc. as well as semi-structured data from s, word documents, etc. This is a veritable data feast thrown compared to traditional systems but it is important that we diet on such data and use only those feeds that create optimum value. This is done through synergy of business knowledge and processes specific to retailer and the industry segment the retailer operates in and set of tools specialized in analyzing huge volume of data in rapid speed. Once data is massaged for downstream systems, big analytics tools are used to analyze. Based on business needs, real-time or offline data processing/analytics can be used. In real-life scenarios, both these approaches are used based on situation and need. Proper analysis needs data not just from consumer insight sources but also from transactional data history and consumer profiles. ACTIONIZE BIG DATA TO BIG IDEAS This is the key part of the Big data cycle. Even the best data cannot be substituted for timely action. The technology and functional stacker will facilitate retailer getting proper insight into key customer intent on purchase what, where, why and at what price. By knowing this, Best Sellers in Tablet PCs Most Wished For in Tablet PCs 1. Kindle Fire HD 7, Dolby Audio, Dual-Band Wi-Fi, 32GB 1. Kindle Fire HD 8.9, 4G LTE Wireless, Dolby Audio, Dual-Band Wi-Fi, 32GB 2. Kindle Fire HD 8.9, Dolby Audio, Dual-Band Wi-Fi, 16GB 2. Kindle Fire HD 8.9, Dolby Audio, Dual-Band Wi-Fi, 32GB 3. Samsung Galaxy Tab 2 (7-Inch,Wi-Fi) 3. Kindle Fire, Full Color 7 Multi-touch Display, Wi-Fi 4. Samsung Galaxy Tab 2 (10.1-Inch, Wi-Fi) 4. Kindle Fire HD 7, Dolby Audio, Dual-Band Wi-Fi, 32 GB 5. Kindle Fire HD 8.9, Dolby Audio, Dual-Band Wi-Fi, 32 GB 5. Samsung Galaxy Tab 2 (7-Inch, Wi-Fi) Figure 2: Correlation between Customer Ratings and Sales Source: Reference [12] 23
26 the retailer can customize the 4Ps (product, pricing, promotions and place) to create enough mindshare from customer perspective that sales become inevitable [10]. For example, a cursory look at random product category (tablet) in an online retailer site shows the strong correlation between customer ratings and sales, i.e., 4 out of 6 best user-rated products are in the top five in sales a 60% correlation even when other parameters like brand, price, release date are not taken into consideration [Fig. 2] 12. The retailer knowing the customer ratings can offer promotions that can tip the balance between sales and lost opportunity. While this example may not be the rule, the key to analysis and actionizing the data is to correlate the importance of user feedback data and concomitant sales. BIG DATA OPPORTUNITIES The implication of Big data analytics on major retailing processes will be along the following areas. Identifying the Product Mix: The assortment and allocation will need to take into consideration the evolving user trends identified from Big data analytics to ensure the offering matches the market needs. Allocation planning especially has to be tactical with shorter lead times. Promotions and Pricing: Retailers have to move from generic pricing strategies to customized user specific. Communication with Customer: Advertising will move from mass media to personalized communication; from one way to two-way communication. Retailers will gain more from viral marketing [12] than from traditional advertising channels. Compliance: Governmental regulations and compliance requirements are mandatory to avoid liability as comingling data from disparate sources can result in generation of personal data beyond the scope of the original user s intent. While data is available globally, the use has to comply with local law of the land and ensure it is done keeping in mind customer s sensibilities. People, Process and Organizational Dynamics: The move to feedback economy requires different organizational mindset and processes. Decision making will need to be more bottom-up and collaborative. Retailers need to engage customer to ensure the feedback loop is in place. Further, Big data being crossfunctional, needs the active participation and coordination between various departments in the organization; hence managing organizational dynamics is the key consideration. Better Customer Experience: Organizations can improve the overall customer experience by providing updates services and thereby eliminating surprises. For instance Big data solutions can be used to pro-actively inform customers of expected shipment delays based on traffic data, climate and other external factors. BIG DATA ADOPTION STRATEGY Presented below is a perspective on how to adopt a Big data solution within the enterprise. 24
27 Define Requirements, Scope and Mandate: Define mandate and objective in terms of what is the required from Big data solution. A guiding factor to identify the requirements would be the prioritized list of business strategies. As part of initiation, it is important to also identify the goal and KPIs that vindicates the usage of Big data. Key Player: Business Key Player: Data Analyst Strategy to Actionize the Insights: Business should create process that would take these inferences as inputs to decision making. Stakeholders in decision making should be identified and actionable inferences have to be communicated at the right time. Speed is critical to the success of Big data. Choosing the Right Data Sources: Once the requirement and scope is defined, the IT department has to identify the various feeds that would fetch the relevant data. These feeds would be structured, semi structured and unstructured. The source could be internal or external. For internal sources, the policies and processes should be defined to enable friction less flow of data. Key Players: IT and Business Choosing the Required Tools and Technologies: After deciding upon the sources of data that would feed the system, the right tools and technology should be identified and aligned with business needs. Key areas are capturing the data, tools and rules to clean the data, identify tools for real-time and offline analytic, identify storage and other infrastructure needs. Key Player: IT Creating Inferences from Insights: One of the key factors to a successful Big data implementation is to have a pool of talented data analyst who can create proper inferences from the insights and facilitate build and definition of new analytic models. These models help in probing the data and understand the insights. Key Player: Business Measuring the Business Benefits: The success of the Big data initiative depends on the value it creates to the organization and its decision making body. It should also be noted that unlike other initiatives, Big data initiatives are usually continuous process in search of the best results. Organizations should be in tune to this understanding to derive the best results. However, it is important that a goal is set and measured to track the initiative and ensure its movement in the right direction. Key Players: IT and Business CONCLUSION The move to feedback economy presents an inevitable paradigm shift for the retail industry. Big data as the enabling technology will play key role in this transformation. As ever, business needs will continue to drive technology process and solution. However, given the criticality of Big data, organizations will need to treat Big data as an existential strategy and make the right investment to ensure they can ride the wave. REFERENCES 1. Customer dynamics. Available at en.wikipedia.org/wiki/customer_ dynamics. 25
28 2. Davenport, T. and. Harris, G., (2007), Competing on Analytics, Harvard Business School Publishing. 3. DeBorde, M., (2006), Do Your Organizational Dynamics Determine Your Operational Success?, The O and P Edge. 4. Lemon, K., Barnett, T., White, Russell S. Winer, Dynamic Customer Relationship Management: Incorporating Future Considerations into the Service Retention Decision, Journal of Marketing. 5. Boyd, J. (September 3, 1976). OODA loop, In Destruction and Creation. Available at wiki/ooda_loop. 6. Doyne, S. (2012), Should Companies Collect Information About You?, NY Times. Available at blogs.nytimes.com/2012/02/21/ should-companies-collect-informationabout-you/. 7. Data, data everywhere (2010), The Economist. Available at economist.com/node/ IDC Digital Universe (2011). Available at chucks_blog/2011/06/2011-idc-digitaluniverse-study-big-data-is-here-nowwhat.html. 9. Gartner Says Solving Big data Challenge Involves More Than Just Managing Volumes of Data (2011). gartner.com/it/page.jsp?id= Gens, F. (2012). IDC Prediction 2012: Competing for Available at Predictions12/Main/downloads/ IDCTOP10Predictions2012.pdf. 11. Bhasin, H. 4Ps of marketing. Available at marketing-mix-4-ps-marketing/. 12. Amazon US site / tablets category (2012). Available at gp/top-rated/electronics/ / ref=zg_bs_tab_t_tr?pf_rd_ p= &pf_rd_s=right- 8&pf_rd_t=2101&pf_rd_i=list&pf_ rd_m=atvpdkikx0der&pf_rd_ r=14ywr6hbvr6xas7wd2gg. 13. Godwin, G. (2008) Viral marketing. Available at blog/2008/12/what-is-viral-m.html. 14. Wang, R. (2012), Monday s Musings: Beyond The Three V s of Big data Viscosity and Virality, softwareinsider.org/2012/02/27/ mondays-musings-beyond-the-threevs-of-big-data-viscosity-and-virality/ 26
29 Infosys Labs Briefings VOL 11 NO Harness Big Data Value and Empower Customer Experience Transformation By Zhong Li PhD Communication Service Providers need to leverage the 3M Framework with a holistic 5C process to extract Big Data value (BDV) In today s hyper-competitive experience economy, communication service providers (CSPs) recognize that product and price alone will not differentiate their business and brand. Since brand loyalty, retention and long-term profitability are now so closely aligned with customer experience, the ability to understand customers, spot changes in their behavior and adapt quickly to new consumer needs is fundamental to the success of the consumer driven Communication Service Industry. The increasingly sophisticated digital consumers demand more personalized services through the channel of their choice. In fact, the internet, mobile and particularly, the rise of social media in the past 5 years have empowered consumers more than ever before. There is a growing challenge for CSPs that are contending with an increasingly scattered relationship with customers who can now choose from multiple channels to conduct business interactions. A recent industry research indicates that some 90% of today s consumers in the US and West Europe interact across multiple channels, representing a moving target that makes achieving a full view of the customer that much more challenging. To compound this trend, the always-on digital customers continuously create more data in various types, from many more touch points with more interaction options. CSPs encounter Big data phenomenon by accumulating significant amounts of customer related information such as purchase patterns, activities on the website, from mobile, social media or interactions with the network and call centre. Such Big data phenomenon presents CSPs with challenges along 3V dimensions (Fig. 1), viz., 27
30 Web Store Large Volume: Recent industry research shows that the amount of data that the CSP has to manage with consumer transaction and interaction has doubled in the past three years, and its growth is also in acceleration to double the size again in the next two years, much of it coming from new sources including blogs, social media, internet search, and networks [7]. Broad Variety: The type, form and format of data are created in a broad variety. Data is created from multiple channels such as online, call centre, stores and social media including Facebook, Twitter and other social media platforms. It presents itself in a variety of types, comprising structured data form transaction, semi-structure data from call records and unstructured data in multimedia forms from social interactions Rapidly Changing Velocity: The always on digital consumers create change dynamics of data in the speed of light. They equally demand fast response from CSPs to satisfy their personalized needs in real time. Volume Mobile Variety Value Velocity Social Call centre Figure 1: Big Data in 3Vs is accumulated from Multiple Channels Source: Infosys Research CSPs of all sizes have learned the hard way that it is very difficult to take full advantage of all of the customer interactions in Big data if they do not know what their customers are demanding or what their relative value to the business is. Even some CSPs that do segment their customers with the assistance of customer relationship management (CRM) system struggle to take complete advantage of that segmentation in developing a real-time value strategy. In hypersophisticated interaction patterns throughout their journey spanning marketing, research, order, service and retention, Big data sheds shining light to expose treasured customer intelligence along aspects of 4Is viz., interest, insight, interaction and intelligence. Interest and Insight: Customers offer their attention for interest and share their insights. They visit a web site, make a call, or access a retail store, share view on social media because they want something from CSP at that moment information about a product or help with a problem. These interactions present an opportunity for the CSP to communicate with a customer who is engaged by choice and ready to share information regarding her personalized wants and needs. Interaction and Intelligence: It is typically crucial for CSPs to target offerings to particular customer segments based on the intelligence of customer data. The success of these real time interactions whether through online, mobile, social media, or other channels depends to a great extent on the CSP s understanding of the customer s wants and needs at the time of the interaction. 28
31 Therefore, alongside managing and securing Big data in 3V dimensions, CSPs are facing a Correlate fundamental challenge on how to explore and harness Big data Value (BDV). Converge Product Control A HOLISTIC 5C PROCESS TO HARNESS BDV Rising to the challenges and leveraging on the opportunity in Big data, CSPs need to harness BDV with predictive models to provide deeper insight into customer intelligence from profiles, behaviours and preferences that are hidden in Big data of vast volume and broad variety, and to deliver superior personalized experience with fast velocity in real time throughout entire customer journey. In the past decade, most CSPs have invested significant amount of efforts in the implementation of complex CRM systems to manage customer experience. While those CRM systems bring efficiency in helping CSPs to deliver on what to do in managing historical transactions, they lack the crucial capability of defining how to act in time with the most relevant interaction to maximize the value for the customer. CSPs now need to look beyond what CRM has to offer and dive deeper to cover how to do things right for the customer by capturing customers subjective sentiment in a particular interaction, resultant insight into predication on what a customers demand from CSPs and trigger proactive action to satisfy their needs, which is more likely to lead to customer delight and ultimate revenues. To do so, CSPs needs to execute a holistic 5C process, i.e., collect, converge, correlate, collaborate and control, in extracting BDV (Fig. 2). Promotion Collect The holistic 5C process will help CSPs to aggregate the whole interaction with a customer across time and channels, support with large volume and broad variety of data including promotion, product, order and services, define interactions with that of customer s preferences. The context of the customer s relationship with the CSP, and actual and potential value that she derives, in particular, determine the likelihood that she consumer will take particular actions based on real time intelligence. Big data can help the CSP correlate the customer s needs with product, promotion, order, service and deliver the right offer at the right time in the appropriate context that she is most likely to respond to. AN OVERARCHING 3M FRAMEWORK TO EXTRACT BDV Customer Service Order Collaborate Figure 2: Harness BDV with a Holistic 5C Process Source: Infosys Research To execute a holistic 5C process for Big data, CSPs need to implement an overarching framework that integrates the various pools of customer related data residing in CSPs enterprise systems, create an actionable customer profile, deliver insight based on that profile in real time customer interaction event and effectively match sales and service resources to take proactive actions, so as to monetize ultimate value on the fly. 29
32 The overarching framework needs to incorporate 3M modules, i.e. Model, Monitor and Mobilize Model Profile: It models customer profile based on all the transactions that helps CSPs gain insight at the individualcustomer level. Such a profile requires not only integration of all customer facing systems and enterprise systems, but integration with all the customer interactions such as , mobile, online and social in enterprise systems such as OMS, CMS, IMS and ERP in parallel with CRM paradigm, and model an actionable customer profile to be able to effectively deploy resources for a distinct customer experience. Monitor Pattern: It monitors customer interaction events from multiple touch points in real time, dynamically senses and triggers matching patterns of events with the defined policies and set models, and makes suitable recommendations and offers at right time through an appropriate channel. It enables CSPs to quickly respond to changes in the marketplace a seasonal change in demand, for example and bundle offerings that will appeal to a particular customer, across a particular channel, at a particular time. Mobilize Process: It mobilizes a set of automations that allows customers enjoy the personalized engaging journey in real time that spans outbound and inbound communications, sales, orders, service and help intervention, and fulfil customer s next immediate demand. The 3M framework needs to be based on an event-driven architecture (EDA) incorporating Enterprise Service Bus (ESB) and Business Process Management (BPM) and should be application and technology agnostic. It needs to interact with multiple channels using events; match patterns of a set of events with pre-defined policies, rules, and analytical models; deliver a set of automations to fulfil personalized experience that spans the complete customer lifecycle. Furthermore, the 3M framework needs to be supported with key high-level functional components, which include: Customer Intelligence from Big data: A typical implementation of customer intelligence from Big data is the combination of Data Warehouse and real time customer intelligence analytics. It requires aggregation of customer and product data from CSP s various data sources in BSS/OSS, leveraging CSP s existing investments with data models, workflows, decision tables, user interface, etc. It also integrates with the key modules in CSP s enterprise landscape, covering: Customer Management: A complete customer relationship management solution combines a 360 degree view of the customer with intelligent guidance and seamless back-office integration to increase first contact resolution and operational efficiency. Offer Management: CSP-specific specialization and re-use capabilities that define new services, products, 30
33 bundles, fulfilment processes and dependencies and rapidly capitalize on new market opportunities and improve customer experience. Order Management: The configurable best practices for creating and maintaining holistic order journey that is critical to the success of such product-intensive functions as account opening, quote generation, ordering, contract generation, product fulfilment and service delivery. Service Management: Case based work automation and a complete view of each case enables an effective management of every case throughout its lifecycle. Event Driven Process Automation: A dynamic process automation engine empowered with EDA leverages the context of the interaction to orchestrate the flow of activities, guiding customer service representatives (CSRs) and selfservice customers through every step in their inbound and outbound interactions, in particular for Campaign Management and Retention Management. Campaign Management: Outbound interactions are typically used to target products and services to particular customer segments based on analysis of customer data through appropriate channels. It uncovers relevant, timely and actionable consumer and network insights to enable intelligently driven marketing campaigns to develop, define and refine marketing messages and target customer with a more effective planand meet customers at the touch points of their choosing through optimized display and search results while generating demand via automated creation, delivery and results tracking. Retention Management: Customers offer their attention, either intrusively or nonintrusively to look for the products and services that meet their needs through the channel of their choices. It dynamically captures consumer data from highly active and relevant outlets such as social media, websites and other social sources and enables CSPs to quickly respond to customer needs and proactively deliver relevant offers for upgrades and product bundles that take into account each customer s personal preference. Experience Personalization: It provides the customer with personalized, relevant experience, enabled from business process automation that connects people, processes and systems in real time and eliminates product, process and channel silos. It helps CSPs extend predictive targeting beyond basic cross-sells to automate more of their cross-channel strategies and gain valuable insights from hidden, consuming and interaction patterns. 31
34 Overall, the 3M framework will empower BDV solution for CSP to execute on the real-time decision that aligns individual needs with business objectives and dynamically fulfils the next best action or offer that will increase the value of each personalized interaction. BDV IN ACTION- CUSTOMER EXPERIENCE OPTIMIZATION By implementing the proposed BDV solution, CSPs can optimize customer experience that delivers the right interaction with each customer at right time so as to build strong relationships, reduce churn, and increase customer value to the business. From Customer Experience Perspective: It provides CSP with real-time, endto end visibility into all the customer interaction events taking place across multi-channels, by correlating and analyzing these events, using a set of business rules, and automatically takes proactive actions which ultimately lead to customer experience optimization. It helps CSP turn their multi-channel contacts with customers into cohesive, integrated interaction patterns, allowing them to better segment their customers and ultimately to take full advantage of that segmentation, deliver personalized experiences that are dynamically tailored to each customer while dramatically improving interaction effectiveness and efficiency. From CSPs Perspective: It helps CSPs quickly weed out underperforming campaigns and learn more about their customers and their needs. From retail store to contact centre to Web to social media, it helps CSPs deliver a new standard of branded, consistent customer experiences that build deeper, more profitable and lasting relationships. It enables CSPs to maximize productivity by handling customer interactions as fast as possible in the most profitable channel. At every point in the customer lifecycle, from marketing campaigns, offer and order to servicing and retention efforts, BDV helps to inform its interactions with that customer s preferences, the context of her relationship with the business, and actual and potential value, enables CSPs focus on creating personalized experiences that balance the customer s needs with business values. Campaign Management: BDV delivers focused campaigns on the customer with predictive modelling and cost-effective campaign automation that consistently distinguishes the brand and supports personalized communications with prospects and customers. Offer Management: BDV dynamically generates offers that account for such factors as the current interaction with the customer, the individual s total value across product lines, past interactions, and likelihood of defecting. It helps deliver optimal value and increases the effectiveness of propositions with nextbest-action recommendations tailored to the individual customer. Order Management: BDV enables the unified process automation applicable to multiple product lines, with agile and 32
35 flexible workflow, rules and process orchestration that accounts for the individual needs in product pricing, configuration, processing, payment scheduling and delivery. Service Management: BDV empowers customer service representatives to act based on the unique needs and behaviours of each customer using realtime intelligence combined with holistic customer content and context. Retention Management: BDV helps CSPs retain more high-value customers with targeted next-best-action dialogues. It consistently turns customer interactions into sales opportunities by automatically prompting customer service representatives to proactively deliver relevant offers to satisfy each customer s unique need. CONCLUSION Today s increasingly sophisticated digital consumers expect CSPs to deliver product, service and interaction experience designed just for me at this moment. To take on the challenge, CSPs need to deliver customer experience optimization powered by BDV in real time. By implementing an overarching 3M BDV framework to execute a holistic 5C process new products can be brought to market with faster velocity and with the ability to easily adapt common services to accommodate unique customer and channel needs. Suffice it to say that BDV will enable CSP to deliver customer-focused experience that matches responses to specific individual demands; provide real time intelligent guidance that streamlines complex interactions; and automate interactions from end-to-end. The result is an optimized customer experience that helps CSPs substantially increase customer satisfaction, retention and profitability, and consequently empowers CSPs evolving into the experience centric Tomorrow s Enterprise. REFERENCES 1. IBM Big data solutions deliver insight and relevance for digital media Solution Brief- June 2012 available at www-05. ibm.com/fr/events/netezzadm.../ Solutions_Big_Data.pdf. 2. Oracle Big data Premier-Presentation (May 2012). Available at premiere.digitalmedianet.com/articles/ viewarticle.jsp?id= SAP HANA for Next-Generation Business Applications and Real-Time Analytics (July 2012). Available at SAS High-Performance Analytics (June 2012). Available at com/reg/gen/uk/hpa?gclid=cjkpvv CJiLQCFbMbtAodpj4Aaw. 5. Transform the Customer Experience with Pega-CRM (2012). Available at files/private/transform-customer- Experience-with-Pega-CRM-WP- Apr2012.pdf. 6. The Forrester Wave : Enterprise Hadoop Solutions for Big data-feb Available at edu/aim/uploads/infotec2012/ HANDOUTS/KEY_ / Infotec2012BigDataPresentationFinal. pdf. 7. Shah S. (2012), Top 5 Reasons Communications Service Providers 33
36 Need Operational Intelligence. Available at Top-5-Reasons-Communications-Service- Providers-Need-Operational-Intelligence. 8. Connolly S. and Wooledge S. (2012), Harnessing the Value of Big data Analytics. Available at wc-0217-harnessing-value-bigdata/. 34
37 Infosys Labs Briefings VOL 11 NO Liquidity Risk Management and Big Data: A New Challenge for Banks By Abhishek Kumar Sinha Implement a Big Data framework and manage your liquidity risk better During the 2008 financial crisis, banks faced an enormous challenge of managing liquidity and remaining solvent. As many financial institutions failed, those who survived the crisis have fully understood the importance of liquidity risk management. Managing liquidity risk on simple spreadsheets can lead to non-real-time and inappropriate information that may not be enough for efficient liquidity risk management (LRM). Banks must have reliable data on daily positions and other liquidity measures that have to be monitored continuously. During signs of stress, like changes in liquidity of various asset classes and unfavorable market conditions, banks need to react to these changes in order to remain credible in the market. In banking liquidity risk and reputation is so heavily linked to the extent that even a single liquidity event can lead to catastrophic funding problems for a bank. MISMANAGEMENT OF LIQUIDITY RISK: SOME EXAMPLES OF FAILURES Northern Rock was a star performer UK bank until the 2007 crisis struck. The source of funding was mostly wholesale funding and capital market funding. Hence in the 2008 crisis, when these funding avenues dried up across the globe, it was unable to fund its operations. During the crisis, the bank s stock fell 32% along with depositors run on the bank. The central bank had to intervene and support the bank in the form of deposit protection and money market operations. Later the Government took the ultimate step of nationalizing the bank. Lehman Brothers had 600 billion in assets before its eventual collapse. The bank s stress testing omitted its riskiest asset -- the commercial real estate portfolio, which in turn led to misleading stress test results. The liquidity of the bank was very low compared to the balance sheet size and the risks it had taken. The bank had used deposits with clearing banks as assets in its liquidity buffer which was not in compliance with the regulatory guidelines. The bank lost 73% in share price during the first half of 2008, and filed for bankruptcy in September
38 2008 financial crisis has shown that the current liquidity risk management (LRM) approach is highly unreliable in a changing and difficult macroeconomic atmosphere. The need of the hour is to improve operational liquidity management on a priority basis. THE CURRENT LRM APPROACH AND ITS PAIN POINTS Compliance/Regulation Across global regulators, LRM principles have become stricter and complex in nature. The regulatory focus is mainly on areas like risk governance, measurement, monitoring and disclosure. Hence, the biggest challenge for the financial institutions worldwide is to react to these regulatory measures in an appropriate and timely manner. Current systems are not equipped enough to handle these changes. For example, LRM protocols for stress testing and contingency funding planning (CFP) focus more on the inputs to the scenario analysis and new stress testing scenarios. These complex inputs need to be very clearly selected and hence it poses a great challenge for the financial institution. Siloed Approach to Data Management Many banks use a spreadsheet-based LRM approach that gets data from different sources which are neither uniform nor comparable. This leads to a great amount of risk in manual processes and data quality issues. In such a scenario, it becomes impossible to collate enterprise wide liquidity position and the risk remains undetectable. Lack of Robust LRM Infrastructure There is a clear lack of a robust system which can incorporate real-time data and generate necessary actions in time. The various liquidity parameters can be changing funding costs, counterparty risks, balance sheet obligations, and quality of liquidity in capital markets. THE NEED OF A READY-MADE SOLUTION In a recent Swift survey, 91% respondents indicated that there is a lack of ready-made liquidity risk analytics and business intelligence applications to complement risk integration processes. Since we can see that the regulation around the globe in form of Basel III, Solvency II, CRD IV, etc., are shaping up hence there is an opportunity to standardize the liquidity reporting process. A solution that can do this can be of great help to banks as it would save them both effort and time, as well as increase the efficiency of reporting. Banks can focus solely on the more complex aspects like inputs to the stress testing process and on business and strategy to control liquidity risk. Even though there can be differences in approach of various banks in managing liquidity, these changes can be incorporated in the solution as per the requirements. CHALLENGES/SCOPE OF REQUIREMENTS FOR LRM The scope of requirements for LRM ranges from concentration analysis of liquidity exposures, calculation of average daily peak of liquidity usage, historical and future view of liquidity flows on both contractual and behavioral in nature, collateral management, stress testing and scenario analysis, generate regulatory reports, liquidity gap across buckets, contingency fund planning, net interest income analysis, fund transfer pricing, to capital allocation. All these liquidity measures are monitored and alerts generated in case of thresholds breached. 36
39 Concentration analysis of liquidity exposures shows some important points on whether the assets or liabilities of the institution are dependent on a certain customer, or a product like asset or mortgage backed securities. It also tries to see if the concentration is region wise country wise, or by any other parameter that can be used to detect a concentration for the overall funding and liquidity situation. Regulatory liquidity reports have Basel III liquidity ratios like liquidity coverage ratio (LCR), net stable funding ratio (NSFR), FSA and Fed 4G guidelines, early warning indicators, funding concentration, liquidity asset/ collateral, and stress testing analysis. Timely completion of these reports in the prescribed format is important for financial institutions to remain complaint with the norms. Calculation of average daily peak of liquidity usage gives a fair idea of the maximum intraday liquidity demand and the firm can keep necessary steps to manage the liquidity in ideal way. The idea is to detect patterns and in times of high, low or medium liquidity scenarios utilize the available liquidity buffer in the most optimized way. Collateral management is very important as the need for collateral and its value after applying the required haircuts has to be monitored on a daily basis. In case of unfavorable margin calls the amount of collateral needs to be adjusted to avoid default in various outstanding positions. Stress testing and scenario analysis is like a self-evaluation for the banks, in which they need to see how bad things can go in case of high stress events. Internal stress testing is very important to see the amount of loss in case of unfavorable events. For the systematically important institutions, regulators have devised some stress scenarios based on the past crisis events. These scenarios need to be given as an input to the stress tests and the results have to be given to the regulators. A proper stress testing ensures that the institution is aware of what risk it is taking and what can be the consequences of the same. Net interest income analysis (NIIA), FTP and capital allocation are performance indicators for an institution that raises money from deposits or other avenues and lends it to customers, or performs an investment to achieve a rate of return. The NII is the difference between the cost of funds to the interest rate achieved by lending or investing the same. The implementation of FTP links the liquidity risk/ market risk to the performance management of the business units. The NII analysis helps in predicting the future state of the P/L statement and balance sheet of the bank. Contingency fund planning contains of wholesale, retail and other funding reports in areas of both secured and unsecured funds, so that in case of these funding avenues drying up banks can look for other alternatives. It states the reserve funding avenues like use of credit lines, repro transactions, unsecured loans, etc., that can be accessed timely and at a reasonable cost in liquidity crisis situation. Intra-group borrowing and lending reports show the liquidity position across group companies. Derivatives reports related to market value, collateral and cash flows are very important to an efficient derivatives portfolio management. Bucket-wise and cumulative liquidity gap under business as usual and stress 37
40 Take Corrective Measures scenario situations give a fair idea of varying liquidity across time buckets. Both contractual and behavioral cash flows are tracked to get the final inflow and outflow scenario. This is done over different time periods, like 30 days to 3 years to get a long term as well as short term view of liquidity. Historic cash flows are tracked as they help in modeling the future behavioral cash flows. Historical assumptions plus current market scenarios are very important in dynamic analysis of behavioral cash flows. Other important reports are related to available pool of unencumbered assets and non-marketable assets. Corporate Governance Strategic Level Planning Identify & Assess Liquidity Risk Monitor & Report PeriodicAnalysis for Possible Gaps Figure 1: Iterative Framework for effective liquidity risk management Source: Infosys Research All the scoped requirements can only be satisfied when the firm has a framework in place to take necessary decisions related to liquidity risk. Hence, next we would have a look into a LRM framework and as well as a data governance framework for managing liquidity risk data. LRM FRAMEWORK Separate group for LRM that is a constituted of members from the asset liability committee, risk committee and top management needs to be formed. This group must function independent of the other groups in the firm and must have the autonomy to take liquidity decisions. Strategic level planning helps in defining the liquidity risk policy in a clear manner related to the overall business strategy of the firm. The risk appetite of the firm needs to be mentioned in measurable terms and the same has to be communicated to all the stakeholders in the firm. Liquidity risks across the business need to be identified and the key risk indicators and metrics are to be decided. Risk indicators are to be monitored on a regular basis, so that in the case of an upcoming stress scenario preemptive steps can be taken. Monitoring and reporting is to be done for internal control as well as for the regulatory compliance. Finally there has to be a periodic analysis of the whole system in order to identify possible gaps in it and the frequency of review has to be at least once in a year and in case of extreme markets scenarios more frequently. To satisfy the scoped out requirements we can see that the data from various sources is used to form liquidity data warehouse and datamart which acts as an input to the analytical engines. The engines contain business rules and logic based on which the key liquidity parameters are calculated. All the analysis is presented in report and dashboards form for both regulatory compliance and internal risk management as well as for decision making purposes. Some Uses of Big data Application in LRM 1. Staging Area Creation for Data Warehouse: Big data application can store huge volumes of data and perform some analysis on it along with aggregating data for further analysis. Due to its fast processing for large amount of data it can be used as loader to 38
41 Data Sources Data Store Reporting / BI Market Data Reference data System of Records Collateral, Deposits, Loans, Securities, Product/LOB General Ledger External Data Load Big Data Application Data quality/ Data checks/ Operational Data Store/ Staging Layer ETL ETL Data Warehouse DataMart General Ledger Reconciliation Analytical Engine Asset Liability Management. Fund Transfer Pricing Liquidity Risk & Capital Calculation Regulatory Reports Basel related ratios NSFR & LCR FED 4G FSAreports Stress testing Reports Regulatory capital allocation Internal Liquidity related Reports Net interest income analysis ALM reports FTP & liquidity costs Funding Concentration Liquid assets Capital allocation & planning Internal stress test Key risk indicators Other reports Figure 2: LRM data governance framework for Analytics and BI with Big data capabilities Source: Infosys Research load data into the data warehouse along with facilitating the extract-transformload (ETL) processes. 2. Preliminary Data Analysis: Data can be moved in from various sources and then using a visual analytics tool to create a picture of what data is available and how it can be used. 3. Making Full enterprise Data Available for High performance Analytics: Analytics at large firms were often limited to the sample set of records on which the analytical engines would run and provide certain results, but as a Big data application provides distributed parallel processing capacity the limitation of number of records is non-existent now. Billions of records can now be processed at increasingly amazing speeds. HOW BIG DATA CAN HELP IN LRM ANALYTICS AND BI Operational efficiency and swiftness is a point where high performance analytics can help to achieve faster decision making because all the required analysis is obtained much faster. Liquidity risk is a killer in today s financial world and is most difficult to tracks as for large banks have diverse instruments and a large number of scenarios need to be analyzed like changes in interest rates, exchange rates, liquidity and depth in the markets 39
42 worldwide, and for such dynamic analysis Big data analytics is a must. Stress testing and scenario analysis, both require intensive computing as lot of data is involved hence faster scenario analysis means quick action in case of stressed market conditions. With Big data capabilities scenarios that would takes hours to otherwise run can now be run in minutes and hence aid in quick decision making and action. Efficient product pricing can be achieved by implementing real time fund transfer pricing system and profitability calculations. This ensures the best possible pricing of market risks along with adjustments like liquidity premium across the business units. CONCLUSION The LRM system is the key for a financial institution to survive in competitive and highly unpredictable financial markets. The whole idea of managing liquidity risk is to know the truth, and be ready for the worst market scenarios. This predictability is what is needed, and can save a bank in times like the 2008 crisis. Even at the business level a proper LRM system can help in better product pricing using FTP, and hence pricing can be logical and transparent. Traditionally data has been a headache for banks and is seen more as compliance and regulation requirement, but going forward there are going to be even more stringent regulations and reporting standards across the globe. After the crisis of 2008 new Basel III liquidity reporting standards, newer scenarios for stress testing have been issued that requires extensive data analysis and can only be timely possible with Big data applications. All in the banking industry know that the future is uncertain and high margins will always be a challenge, so an efficient data management along with Big data capabilities needs to be in place. This will add value to the banks profile by clear focus on the new opportunities for banks and bring predictability to their overall businesses. Successful banks in future would be the ones who take LRM initiatives seriously and implement the system successfully. Banks with an efficient LRM system would definitely build a strong brand and reputation in the eyes of investors, customers, and regulators around the world. REFERENCES 1. Banking on Analytics: How High- Performance Analytics Tackle Big data Challenges in Banking (2012), SAS white paper. Available at resources/whitepaper/wp_42594.pdf. 2. New regime, rules and requirements welcome to the new liquidity, Basel lll: implementing liquidity requirements, ERNST & YOUNG (2011). 3. Leveraging Technology to Shape the future of Liquidity Risk Management, Sybase Aite. Group study, July, Managing liquidity risk, Collaborative solutions to improve position management and analytics (2011), SWIFT white paper. 5. Principles for Sound Liquidity Risk Management and Supervision, BIS Document, (2008). 6. Technology Economics: The Cost of Data, Howard Rubin, Wall Street and Technology Website, Available at
43 Infosys Labs Briefings VOL 11 NO Big Data Medical Engine in the Cloud (BDMEiC): Your New Health Doctor By Anil Radhakrishnan and Kiran Kalmadi Diagnose, customize and administer health care on real time using BDMEiC Imagine a world, where the day to day data about an individual s health is tracked, transmitted, stored, analyzed on a real-time basis. Worldwide diseases are diagnosed at an early stage without the need to visit a doctor. And lastly a world, where every individual will have a life certificate that contains all their health information, updated on a real time basis. This is the world, to which Big data can lead us to. Given the amount of data generated for e.g.,, body vitals, blood samples, etc., every day in the human body, it s a haven for generating Big data. Analyzing this Big data in healthcare is of prime importance. Big data analytics can play a significant role in the early detection/advanced diagnosis of such fatal diseases that which can reduce health care cost and improve quality. Hospitals, medical universities, researchers, insurers will be positively impacted on applying analytics on this Big data. However, the principal beneficiaries of analyzing this Big data will be the Government, patients and therapeutic companies. RAMPANT HEALTHCARE COSTS A look at the healthcare expenditure of countries like US and UK, would automatically explain the burden that healthcare is on the economy. As per data released by Centers for Medicare and Medicaid Services, health expenditure in the US is estimated to have reached $2.7 trillion or over $8,000 per person [1]. By 2020, this is expected to balloon to $4.5 trillion [2]. These costs will have a huge bearing on an economy that is struggling to get up on its feet, having just come out of a recession. According to the Office for National Statistics in the UK, healthcare expenditure in UK amounted to billion in 2010; from billion in 2009 [3]. With rising healthcare cost, countries like Spain have already pledged to save 7 Billion by slashing health spending, while also charging more for drugs [5]. Middle income earners will now have to pay more for drugs. This increase in healthcare costs is not isolated to a few countries alone. According to World Health Organization statistics released 41
44 in 2011, per capita total expenditure on health jumped from US$ 566 to US$ 899 from 2000 to 2008, an alarming increase of 58% [4]. This huge increase is testimony to the fact that far from increasing steadily, healthcare costs have been increasing exponentially. While healthcare costs have been increasing, the data generated through body vitals, lab reports, prescriptions, etc. has also been increasing significantly. Analysis of this data will lead to better and advanced diagnosis, early detection and more effective drugs which in turn will result in significant reduction in healthcare costs. HOW BIG DATA ANALYTICS CAN HELP REDUCE HEALTHCARE COSTS? Analysis of Big data that is generated from various real time patient records possesses a lot of potential for creating quality healthcare at reduced costs. Real time refers to data like body temperature, blood pressure, pulse/ heart rate, and respiratory rate that can be generated every 2-3 minutes. This data collected across individuals provides the volume of data at a high velocity, while also providing the required variety since it is obtained across geographies. The analysis of this data can help in reducing costs by enabling real time diagnosis, analysis and medication, which offers Improved insights into drug effectiveness Insights for early detection of diseases Improved insights into origins of various diseases Insights to create personalized drugs. These insights that Big data analytics provides are unparalleled and go a long way in reducing the cost of healthcare. USING BIG DATA ANALYTICS FOR PERSONALIZING DRUGS The patents of many high profile drugs are ending by Hence, therapeutic companies need to examine the response of patients to these drugs to help create personalized drugs. Personalized drugs are those that are tailored according to an individual patient. Real time data collected from various patients will help generate Big data, the analysis of which will help identify how individual patients, reacted to the drugs administered to them. By this analysis, therapeutic companies will be able to create personalized drugs custom-made to an individual. A personalized drug is one of the important solutions that Big data analytics will have the power to offer. Imagine a situation where, analytics will help determine the exact amount and type of medicine that an individual would require, even without them having to visit a doctor. That s the direction in which Big data analytics in healthcare has to move. In addition, the analytics of this data can also significantly reduce healthcare costs that run into billions of dollars every year. BIG DATA ANALYTICS FOR REAL TIME DIAGNOSIS USING BIG DATA MEDICAL ENGINE IN THE CLOUD (BDMEIC) Big data analytics for real time diagnosis are characterized by real time Big data analytics systems. These systems contain a closed loop feedback system, where insights from the application of the solution serve as feedback for further analysis. (Refer Figure 1). Access to real time data provides a quick way to accumulate and create Big data. The closed loop feedback system is important because it helps the system in building its intelligence. These systems can not only help 42
45 Real Time Medical Data to monitor patients in real time but can also be used to provide diagnosis, detect early and deliver medication in real time. This can be achieved through a Big data Medical Engine in the Cloud (BDMEiC) [Fig. 2]. This solution would consist of: Analysis of real time data Two medical patches (arm and thigh) Analytics engine Smartphone Data Center. New solution based on analysis Real Time Big Data Analytics system Feedback Newer Insights from Solutions Figure 1: Real Time Big Data Analytics System Source: Infosys Research Source: Infosys Research which is synced with the patch. The extraction of the data happens at regular intervals (every 2-3 minutes). The smartphone transmits the real time data to the data center in the medical engine. The thigh based electronic medical patch is used for providing medication. The patch comes with a drug cartridge (pre-loaded drugs) that can be inserted into a slot in the patch. When it receives data from the smartphone, the device can provide the required medication to the patient through auto-injectors that are a part of the drug cartridge. 2. Data Center The data center is the Big data cloud storage that receives real time data from the medical patch and stores it. This data center will be a repository of real time data received across different individuals across geographies. This data is then transmitted to the Big data analytics engine As depicted above, the BDMEiC solution consists of the following: 1. Arm and thigh based electronic medical patch An arm based electronic medical patch (these patches are thin, lightweight, elastic and have embedded sensors) that can monitor the patient is strapped to the arm of an individual, which reads vitals like body temperature, blood pressure, pulse/heart rate, and respiratory rate to monitor brain, heart, muscle activity, etc. The patch then transmits this real time data to the individual s smartphone 3. Big Data Analytics Engine The Big data analytics engine performs three major functions - analyzing data, sharing analyzed data with organizations and transmitting medication instructions back to the smartphone. Analyzing Data: It analyzes the data (like body temperature, blood pressure, pulse/heart rate, and respiratory rate, etc.) received from the data center using its inbuilt medical intelligence, across individuals. As the system keeps analyzing this data it also keeps building on its intelligence. 43
46 2 Medical Engine Data Analytics Center Engine Organizations Medical Labs Medical Universities Medical Research Centers Therapeutic Companies Real time Medication With the analytics engine, monitoring patient data in real time, the diagnosis and treatment of patients in real time is possible. With the data being shared with top research facilities and medical institutions in the world, the diagnosis and treatment would be more effective and accurate. Figure 2: Big Data Medical Engine in the Cloud (BDMEiC) Source: Infosys Research Sharing Analyzed Data: The analytics engine also transmits its analysis to various universities, medical centers, therapeutic companies and other related organizations for further research. Transmitting Medication Instructions: The analytics engine also can transmit medication instructions to an individual s smartphone, which in turn transmits data to the thigh patch, whenever medication has to be provided. The BDMEiC solution can act as a real time doctor that diagnoses, analyzes, and provides personalized medication to individuals. Such a solution that harnesses the potential of Big data provides manifold benefits to various beneficiaries. BENEFITS AND BENEFICIARIES OF BDMEIC The BDMEiC solution if adopted in a large scale manner can offer a multitude of benefits, few of which are listed below. Specific Instances: Blood pressure data can be monitored real time and stored in the data center. The analysis of this data by the analytics engine can keep the patients as well as doctor updated real time, if the blood pressure moves beyond permissible limits. Beneficiaries: Patients, medical institutions and research facilities. Convenience The BDMEiC solution offers convenience to patients, who would not always be in a position to visit a doctor. Specific Instances: Body vitals can be measured and analyzed with the patient being at home. This especially helps in the case of senior citizens and busy executives who can now be diagnosed and treated right at home or while on the move. Beneficiaries: Patients. Insights into drug effectiveness The system allows doctors, researchers and therapeutic companies to understand the impact of their drugs in real time. This helps them to create better drugs in the future. Specific Instances: The patents of many high profile drugs are ending by Therapeutic companies can use BDMEiC to perform real 44
47 time Big data analysis, to understand their existing drugs better, so that they can create better drugs in the future. Beneficiaries: Doctors, researchers and therapeutic companies Beneficiaries: Patients and doctors Reduced Costs Real time data collected from BDMEiC assists in the early detection of diseases, thereby reducing the cost of treatment. Early Detection of Diseases As BDMEiC monitors, stores, and analyzes data in real time, it allows medical researchers, doctors and medical labs to detect diseases at an early stage. This allows them to provide an early cure. Specific Instances: Early detection of cancer and other life threatening diseases can lead to lesser spending on healthcare. Beneficiaries: Government and patients. Specific Instances: Early detection of diseases like cancer, childhood pneumonia, etc., using BDMEiC can help provide medication at an early stage thereby increasing the survival rate. Beneficiaries: Researchers, medical Labs and patients. Improved Insights into Origins of Various Diseases With BDMEiC storing and analyzing real time data, researchers get to know the cause and symptoms of a disease much better and at an early stage. Specific Instances: Newer strains of viruses can be monitored and researched in real time. Beneficiaries: Researchers and medical labs. Insights to Create Personalized Drugs Real time data collected from BDMEiC will help doctors administer the right dose of drugs to the patients. Specific Instances: Instead of a standard pill, patients can be given the right amount of drugs, customized according to their needs. CONCLUSION The present state of the healthcare system leaves a lot to be desired. Healthcare costs are spiraling and forecasts suggest that they are not poised to come down any time soon. In such a situation, organizations world over, including governments should look to harness the potential of real time Big data analytics to provide high quality and cost effective healthcare. vthe solution proposed in this paper, tries to utilize this potential to bridge the gap between medical research, and the final delivery of the medicine. REFERENCES 1. US Food and Drug Administration, National Health Expenditure Projections (January 2012), Centers for Medicare & Medicaid Services, Office of the Actuary. Available at cms.gov/research-statistics-data- and-systems/statistics-trends-and- Reports/NationalHealthExpendData/ Downloads/Proj2011PDF.pdf. 3. Jurd, A. (2012), Expenditure on healthcare in the UK , Office for National Statistics. Available at gov.uk/ons/dcp171766_ pdf. 45
48 4. World Health Statistics 2011, World Health Organization. Available at whostat/en_whs2011_full.pdf. 5. The Ministry of Health, Social Policy and Equality Spain (). Available at publicaciones/comic/docs/pilladaingles.pdf. 46
49 Infosys Labs Briefings VOL 11 NO Big Data Powered Extreme Content Hub By Sudheeshchandran Narayanan and Ajay Sadhu Taming Big content explosion and providing contextual and relevant information is the need of the day Content is getting bigger by the minute and smarter by the second [5]. As content grows in size and becomes varied in structure, discovery of valuable and relevant content becomes a challenge. Existing Content Management (ECM) products are limited by scalability, variety, rigid schema, limited indexing and processing capability. Content enrichment often is an external activity and not often deployed. The content manager is more like a content repository and is used primarily for search and retrieval of the published content. Existing content management solutions can handle few data formats and provide very limited capability with respect to content discovery and enrichment. With the arrival of Big Content, the need to extract, enrich, organize and manage the semi-structured and un-structured content and media is increasing. As the next generation of users will rely heavily on the new modes of interacting with the content for e.g., mobile devices and tablets, there is a need to relook at the traditional content management strategies. Artificial intelligence will now play a key role in information retrieval, information classification and usage for these sophisticated users. To facilitate the usage of Artificial Intelligence on this Big Content, there is a need to have knowledge on entities, domain, etc., to be captured, processed, reused, and interpreted by the computer. This has resulted in formal specification and capture of the structure of the domain called ontologies. Classification of these entities within the domain into predefined categories called taxonomy and inter-relating them to create the semantic web (web of data). The new breed of content management solutions need to bring in elastic indexing, distributed content storage and low latency to address these changes. But the story does not end there. The ease to deploy 47
50 technologies like natural language text analytics, machine learning now takes these new breed of content management to the next level of maturity. Time is the essence for everyone today. Contextual filtering of the content based on relevance is an immediate need. There is a need to organize content, create new taxonomy, and create new links and relationships beyond what is specified. The next generation of content management solutions should leverage the ontologies, semantic web and linked data to derive the context of the content and enrich the content metadata with this context. Then leveraging this context, the system should provide realtime alerts as the content arrives. In this paper, we discuss the details of the extreme content hub and its implementation semantics, technology viewpoint and use cases. THE BIG CONTENT PROBLEM IN TODAYS ENTERPRISES Legacy Content Management System (CMS) has focused on addressing the fundamental problems in content management i.e., content organization, indexing, and searching. With the internet evolution, these CMS evolved to Content Publishing Lifecycle Management (CPLM) and workflow capabilities to the overall offering. The focus of these ECM products were towards providing a solution for the enterprise customers to easily store and retrieve various documents and provide a simplified search interface. Some of these solutions evolved to address the web publishing problem. These existing content management solutions have constantly shown performance and scalability concerns. Enterprises have invested in high end servers and hired performance engineering experts to address this. But will this last long? Automated Content Discovery Heterogeneous Content Ingestion Core Features Indexing Search Workflow Metadata Repository Content Versioning Highly Available Elastic Scalable System Unified Intelligent Content Access and Insights Content Enrichment Figure 1: Augmented Capabilities of Extreme Content Hub Manager Source: Infosys Research 48
51 With the arrival of Big data (volume, variety and velocity), these problems have amplified further and the need for next generation capabilities for content management has evolved further. Requirements and demand has gone just beyond storing, searching and indexing of traditional documents. Enterprise needs to store a wide variety of contents ranging from documents, videos, social media feeds, blogs posts, podcast, images, etc. Extraction, enrichment, organization and management of semi, unstructured and multi-structured content and media are a big challenge today. Enterprises are under tremendous competitive pressure to derive meaningful insights from these piles of information assets and derive business value from this Big data. Enterprises are looking for contextual and relevant information at lightning speed. The ECM solution must address all of the above technical and business requirements. EXTREME CONTENT HUB: KEY CAPABILITIES Key capabilities required for the Extreme Content Hub (ECH) apart from the traditional indexing, storage and search capabilities can be classified in the following five dimensions. (Fig. 2) Heterogeneous Content Ingestion that provides input adapters to a wide variety of content (document, videos, images, blogs, feeds, etc.) into the content hub seamlessly. The next generation of content management system needs to support Real-Time Content Ingestion for RSS feeds, news feeds, etc. and support stream of events to be ingested as one of the key capabilities for content ingestion. Automated Content discovery that extracts the metadata and classifies the incoming content seamlessly to pre-defined ontologies and taxonomies. Scalable, Fault-tolerant Elastic System that can seamlessly expand to the demands of volume, velocity and variety growth of the content. Content Enrichment services that leverages machine learning and text analytics technologies to enrich the context of the incoming content. Unified Intelligent Content Access that provides a set of content access services that are context aware and based on information relevance by user modeling and personalization. To realize ECH, there is a need to augment the existing search and indexing technologies with the next generation of machine learning and text analytics to bring in a cohesive platform. The existing content management solution still provides quite a good list of features that cannot be ignored. BIG DATA TECHNOLOGIES: RELEVANCE FOR THE CONTENT HUB With the advent of Big data, the technology landscape has made a significant shift. Distributed computing has now become a key enabler for large scale data processing and with open source contributions this has received a significant boost in recent years. Year 2012 has been the year for large scale Big data technology adoption. The other significant advancement has been in the NoSQL (Not Only SQL) technology which complements the existing RDBMS systems for scalability and flexibility. Scalable near real-time access provided by these systems has boosted the adoption of distributed 49
52 Unified Enterprise Content Access Existing Enterprise Content Social Feed Integration Log Feeds from various enterprise system News, Alerts & RSS Feeds (Real Time) Extreme Content Hub Content Management Interface Un-Structured Content Extractor Content Services Search Services Metadata Extractor Index Storage (Hbase) Dashboard Content Classification Service Machine Learning Algorithms Auto - Recommendation Classifier Rule Engine Distributed File System (Hadoop) Link Storage (Hbase) Alerts & Content API Service Content Classification Service Unified Content Extractor Metadata Driven Augmented CM Processing Framework (Generic Transformation, Dynamic Cluster Expansion, Audit Logging) Existing Enterprise CM Content Processing Workflows (Task Co-ordination, sequencing, scheduling etc. for Backend Processing) Knowledge Feeds to various existing systems Figure 2: Extreme Content Hub Source: Reference [12] computing for real-time data storage and indexing needs. Scalable and elastic deployments provided by the advancement in private and public cloud deployments has accelerated adoption of distributed computing in enterprises. Overall, there is a significant change from our earlier approaches to solve the ever increasing data and performance problem by throwing more hardware at the problem. Today deploying a scalable distributed computing infrastructure that not only addresses the velocity, variety and volume problem but also providing it at a cost effective alternative using open source technologies provides the business case for building the ECH. The solution to the problem is to augment the existing content management solution with the processing capabilities of the Big data technologies to create a comprehensive platform that brings in the best of both worlds. REALIZATION OF THE ECH ECH requires a scalable fault tolerant elastic system that provides scalability on storage, compute and network infrastructure. Distributed processing technologies like Hadoop provide the foundation platform for this. Private cloud based deployment model will provide the on-demand elasticity and scale that is required to setup such a platform. Metadata model driven ingestion framework could ingest a wide variety of feeds to the hub seamlessly. Content ingestion could deploy content security tagging during the ingestion process to ensure that the content stored inside the hub is secured and authorized before access. NoSQL technologies like HBase and MongoDB could provide the scalable metadata repository needs for the system. 50
53 Search and indexing technologies have matured to be next level after the advent of the Web and deploying a scalable indexing service like Solr, Elastic Search, etc., provides the much needed scalable indexing and search capability required for the platform. Deploying machine learning algorithms leveraging Mahout and R on this platform can bring in auto-discovery of the content metadata and auto-classification for content enrichment. De-duplication and other value added services can be seamless deployed as batch framework on the Hadoop infrastructure to bring value added context to the content. Machine learning and text analytics technologies can be further leveraged to provide the recommendation and contextualization of the user interactions to provide unified context aware services. BENEFITS OF ECH ECH is at the center of enterprise knowledge management and innovation. Serving contextual and relevant information to the users will be one of the fundamental usages ECH. Auto-indexing will help discover multiple facets of the content and help in discovering new patterns and relationships between the various entities that would have been particular unnoticed in the legacy world. The integrated metadata view of the content will help in building a 360 degree view on a particular domain or entity from the various sources. ECH could enable discovery of user taste and likings based on the content searched and viewed. This could serve real-time recommendation to users through content hub services. This could help the enterprise in specific user behavior modeling. Emerging trends in the various domains can be discovered as content gets ingested on the hub. ECH could extend as an analytics platform for video and text analytics. Realtime information discovery can be facilitated using pre-defined alerts/rules which could get triggered as new content arrives in the hub. The derived metadata and context could be pushed to the existing content management solution to derive the benefits and investments done on the existing products and platforms and augment the processing and analytics capabilities with new technologies. ECH will now be able to handle large volumes, wide variety of content formats and bring in deep insights leveraging the power of machine learning. These solutions will be very cost effective and will also leverage existing investment in the current CMS. CONCLUSION There need is to take a platform centric approach to this Big content problem rather than a standalone content management solution. There is a need to look at it strategically and adopt a scalable architecture platform to address this. However such initiative doesn t need to replace the existing content management solutions but to augment the capabilities to fill in required white spaces. The approach discussed in this paper provides one such implementation of the augmented content hub leveraging the current advancement in Big data technologies. Such an approach will provide the enterprise with a competitive edge in years to come. REFERENCES 1. Agichtein, E., Brill, E. and Dumais, S. (2006), Improving web search ranking by incorporating user behavior. Available `at um/people/sdumais/. 2. Dumain, S. (2011), Temporal Dynamics 51
54 and Information Retrieval. Available at um/people/sdumais/. 3. Reamy, T. (2012), Taxonomy and Enterprise Content Management. Available at com/presentations.shtml. 4. Reamy, T. (2012), Enterprise Content Categorization How to Successfully Choose, Develop and Implement a Semantic Strategy, kapsgroup.com/presentations/ ContentCategorization-Development.pdf. 5. Barroca, E. (2012), Big data s Big Challenges for Content Management, TechNewsWorld. Available at html. 52
55 Infosys Labs Briefings VOL 11 NO Complex Events Processing: Unburdening Big Data Complexities By Bill Peer, Prakash Rajbhoj and Narayanan Chathanur Analyze, crunch and detect unforeseen conditions in real time through CEP of Big Data study by The Economist revealed that 1.27 A Zettabyte was the amount of information in existence in 2010 as household data [1]. The Wall Street Journal reported Big data as the new boss in all key sectors such as education, retail and finance. But on the other side, an average Fortune 500 enterprise is estimated to have around 10 years worth of customer data and more than two-thirds of it being unusable. How can enterprises make such an explosion of data usable and relevant? Not trillions but quadrillions amount of data for analysis overall and it is expected to increase exponentially and evidently impacts businesses worldwide. Additionally the problem is of providing speedier results and that is expected to go slower with more data to analyze unless technologies innovate in the same pace. Any function or business, whether it is road traffic control, high frequency trading, auto adjudication of insurance claims or controlling supply chain logistics of electronics manufacturing, all requires huge data sets to be analyzed as well as a need for timely processing and decision making. Any delay even in seconds or milliseconds affects the outcome. Significantly, technology should be capable of interpreting historical patterns, apply them to current situations and take accurate decisions with minimal human interference. Big data is about the strategy to deal with vast chunk of incomprehensible data sets. There is now awareness across industries that traditional methods of data stores and processing power like databases, files, mainframes or even mundane caching cannot be used as a solution for Big data. Still the existing models do not address capabilities of processing, analysis of data, integrating with events and real time analytics, all in split second intervals. On the other hand, Complex Event Processing (CEP) has evolved to provide solutions in utilizing memory data grids for analyzing trends, patterns and events in real time and assessments in a matter of milliseconds. However, Event Clouds, a byproduct of using 53
56 CEP techniques, can be further leveraged to monitor for unforeseen conditions birthing, or even the emergence of an unknown-unknown, creating early awareness and potential first mover advantage for the savvy organization. To set the context of the paper we attempt at highlighting how CEP with in-memory data grid technologies helps in pattern detection, matching, analysis, processing and decision making in split seconds with the usage of Big data. This model should serve any industry function where time is the essence and Big data is at the core and CEP acts as the mantle. Later, we propose treating an Event Cloud as more than just an event collection bucket used for event pattern matching or as simply the immediate memory store of an exo-cortex for machine learning; an Event Cloud is also a robust corpus with its own intrinsic characteristics that can be measured, quantified, and leveraged for advantage. For example, by automating the detection of a shift away from an Event Cloud s steady state, the emergence of a previously unconsidered situation may be observed. It is this application, programmatically discerning the shift away from an Event Cloud s normative state, which is explored in this paper. CEP AS REAL TIME MODEL FOR BIG DATA: SOME RELEVANT CASES In current times, traffic updates are integrated with cities traffic control system as well as many global positioning service (GPS) electronic receivers used quite commonly by drivers. These receivers automatically adjust and reroute in case of the normal route is traffic ridden. This helps but the solution is reactionary. Many technology companies are investing in pursuit of the holy grail of the solution to detect and predict traffic blockages and take proactive action to control the traffic itself and even avoid mishaps. For this there is a need to analyze traffic data over different parameters such as rush hour, accidents, seasonal impacts of snow, thunderstorms, etc., and come up with predictable patterns over years and decades. Second is application of this pattern to input conditions. All this requires huge data crunching, analyses and on top of it real time application such as CEP. Big data has already taken importance in financial market particularly in high frequency trading. Since the 2008 economic downturn and its rippling effects on the stock market, the volume of trade has come down at all the top exchanges such as New York, London, Singapore, Hong Kong or Mumbai. But the contrasting factor is the rise in High Frequency Trading (HFT). It is claimed that around 70% of all equity trades were accounted by HFT in 2010 versus 10% in HFT is 100% dependent on technology and the trading strategies are developed out of complex algorithms. Only those trades will have a better win ratio that has developed a better strategy and has more data to crunch in faster time. This is where CEP could be useful. The healthcare industry in USA is set to undergo a rapid change with the Affordable Care Act. Healthcare insurers are expected to see an increase in their costs due to increased risks of covering more individuals and legally cannot deny insurance with pre-conditions. Hospitals are expected to see more patient data which means increased analyses and pharmaceutical companies need better integration with the insurers and consumers to have speedier and accurate settlements. Even though most of these transactions can be performed on non-real time basis, technology still needs both Big data and complex processing for a scalable solution. In India the outstanding cases in various judicial courts touch 32 million. In USA, family based cases and immigration related ones 54
57 are piling up waiting for a hearing. Judicial pendency has left no country untouched. Scanning through various federal, state and local law points, past rulings, class suits, individual profiles, evidence details etc., are required to put forward the cases for the parties involved and the winner is the one who is able to present a better analysis of available facts. Can technology help in addressing such problems across nations? All of these cases across such diverse industries showcase the importance of processing gigantic amounts of data and also the need to have the relevant information churned out in right time. WHY AND WHERE BIG DATA Big data has evolved due to the existing limitations of current technologies. Two-tier or multitier architecture with even a high performing database at one end is not enough to analyze and crunch such colossal information in desired time frames. The fastest databases today are benchmarked at tera bytes of information as noted by the transaction processing council Volumes of exa and zetta bytes of data need a different technology. Analysis of unstructured data is another criterion for the evolution of Big data. Information available as part of health records, geo maps, multimedia (audio, video and picture) is essential for many businesses and mining such unstructured sets require storage power as well as transaction processing power. Add this to the variety of sources such as social media, legacy systems, vendor systems, localized data, mechanical and sensor data. Finally the critical component of Speed to get the data through the steps of Unstructured Structured Storage Mine Analyze Process Crunch Customize Present. BIG DATA METHODOLOGIES: SOME EXAMPLES Apache Hadoop project [2] and its relatives such as Avro, ZooKeeper, Cassandra, Pig provided the non-database form of technology as the way to solve problems with massive data. It used distributed architecture as the foundation to remove the constraints of traditional constructs. Both Data (storage, transportation) and Processing (analysis, conversion, formatting) are distributed in this architecture. Figure 1 and Figure 2 compare the traditional vs. Distributed Architecture. Validation Data Nodes Data Nodes Enrichment Transformation Processing Nodes Strandardization Route Data Nodes Data Nodes Operate Processing Nodes Server Tier Middle Tier Client Tier Distributed Nodes Client Tier Figure 1: Conventional Multi-Tier Architecture Source: Infosys Research Figure 2: Distributed Multi-Nodal Architecture Source: Infosys Research 55
58 A key advantage of distributed architecture is scalability. Nodes can be added without affecting the design of the underlying data structures and processing units. IBM has even gone a step ahead in getting Watson [5], the famous artificial intelligent computer which can learn as it gets more information and patterns for decision making. Similarly IBM [6], Oracle [7], Teradata [8] and many leading software providers have created the Big data methodologies as an impetus to help enterprise information management. VELOCITY PROBLEM IN BIG DATA Even though we clearly see the benefits of Big data and its architecture can easily be applicable to any industry, there are some limitations that is not easily perceivable. Few pointers: Can Big data help a trader to give the best win scenarios based on millions and even billions of computations of multiple trading parameters in real time? Can Big data forecast traffic scenarios based on sensor data, vehicle data, seasonal change, major public events and provide alternate path to drivers through their GPS devices in real time helping both city officials as well as drivers to save time? Can Big data detect fraud detection scenarios running through multiple shopping patterns of a user through historical data and match with the current transaction in real time? Can Big data provide real time analytical solutions out of the box and support predictive analytics? There are multiple business scenarios in which data has to be analyzed in real time. These data are created, updated and transferred because of real time business or system level events. Since the data is in the form of real time events, this requires a paradigm shift in the methodology in the way data is viewed and analyzed. Real time data analyses in such cases means that data has to be analyzed before the data hits the disk. Difference between event and data just vanishes. In such cases across the industry where Big data is unequivocally needed to manage the data but to use this data effectively and integrate with real time events and provide business with express results, a complimentary technology is required and that s where CEP can fit in. VELOCITY PROBLEM: CEP AS A SOLUTION The need here is the analyses of data arriving through the form of real time event streams and identifying patterns or trends based on vast historical data. Adding to the complexity is other real time events. The vastness is solved with Big data and real time analysis of multiple events, pattern detection and appropriate matching and crunching is solved by CEP. Real time event analysis ensures avoiding duplicates and synchronization issues as data is still in flight and storage is still a step away. Similarly it facilitates predictive analysis of data by means of pattern matching and trending. This enables enterprise to provide early warning signals and take corrective measures in real time itself. Reference architecture of traditional CEP is shown in Figure 3. CEP s original objective was to provide processing capability similar to Big data with 56
59 Event Originator Event Catalog Object Model Catalog Domain model Catalog Meta Data Repository Event Generation and Capture Feature Set Debug Capability Standard Functions Multi User Support Language Constucts Dev., Business User Tools (Platform Independent) Event Access Event Attributes Relationships Persistence Models CEP Languages Preprocessing Actions Storage Options Scalability Event Modeling and Management Event Handlers Refine Patterns Event Consumer Aggregate and correlate Event Processing and Logic Patterns Event Streams Security and Search Domain Specific Algorithms Visualize Event Processing Engine Event Pre-filtering Monitoring and Administration Tools Access Management Failure and Recovery Memory Management Security and Authentication User Roles Performance Figure 3: Complex Events Processing- Reference Architecture Source: Infosys Research distributed architecture and in memory grid computing. The difference was CEP was to handle multiple events seemingly unrelated and correlate them to provide a desired and meaningful output. The backbone of CEP though can be the traditional architectures such multi-tier technologies with CEP usually in the middle tier. Figure 4 shows how the CEP on Big data solves the velocity problem with Big data and complements the overall information management strategy for any enterprise that aims to use Big data. CEP can utilize Big data particularly by highly scalable in-memory data grids to store the raw feeds, events of interests and detected events and analyze this data in real time by correlating with other in flight events. Fraud detection is a very apt example where historic data of the customer s transaction, his usage profile, location, etc., is stored in the in memory data grid and every new event (transactions) from the customer is analyzed by CEP engine by correlating and applying patterns on the event data with the historic data stored in the memory grid. There are multiple scenarios some of them outlined through this paper where CEP complements Big data and other offline analytical approaches to accomplish an active and dynamic event analytics solution. EVENT CLOUDS AND DETECTION TECHNIQUES CEP and Event Clouds A linearly ordered sequence of events is called an event stream [9]. An event stream may contain many different types of events, but there must be some aspect of the events in the event stream that allow for a specific ordering. This is typically an ordering via timestamp. 57
60 Event Originator Event Catalog Object Model Catalog Domain model Catalog Meta Data Repository Event Generation and Capture Feature Set Debug Capability Standard Functions Multi User Support Language Constucts Dev., Business User Tools (Platform Independent) Event Access Event Attributes Relationships Persistence Models CEP Languages Preprocessing Actions Storage Options Scalability Event Modeling and Management Refine Event Handlers Patterns Aggregate and correlate Event Consumer Event Streams Security and Search Domain Specific Algorithms Visualize Event Processing Engine Event Processing and Logic Patterns Dashboard Query Agent Write Connector In Memory DB or Data Grid Big Data Figure 4: CEP on Big Data Source: Infosys Research By watching for Event patterns of interest, such as multiple usages of the same credit card at a gas station within a 10 minute window, in an event stream, systems can respond with predefined business driven behaviors, such as placing a fraud alert on the suspect credit card. An Event Cloud is a partially ordered set of events (POSET), either bounded or unbounded, where the partial orders are imposed by the causal, timing and other relationships between events [10]. As such, it is a collection of events within which the ordering of events may not be possible. Further, there may or may not be an affinity of the events within a given Event Cloud. If there is an affinity, it may be as broad as all events of interest to our company or as specific as all events from the emitters located at the back of the building. Event Clouds and event streams may contain events from sources outside of an organization, such as stock market trades or tweets from a particular twitter user. Event Clouds and event streams may have business events, operational events, or both. Strictly speaking, an event stream is an Event Cloud, but an Event Cloud may or may not be an event stream, as dictated by the ordering requirement. Typically, a landscape with CEP capabilities will include three logical units: (i) emitters that serve as sources of events, (ii) a CEP engine, and (iii) targets to be notified under certain event conditions. Sources can be anything from an application to a sensor to even the CEP engine itself. CEP engines, that are the heart of the system, are implemented in one of two fundamental ways. Some follow the paradigm of being rules based, matching on explicitly stated event patterns using algorithms like Rete, while other CEP engines use the more sophisticated event analytics approach looking 58
61 for probabilities of event patterns emerging using techniques like Bayesian Classifiers [11]. In either case of rules or analytics, some consideration of what is of interest must be identified up front. Targets can be anything from dashboards to applications to the CEP engine itself. Users of the system, using the tools provided by the CEP provider, articulate events and patterns of events that they are interested in exploring, observing, and/or responding to. For example, a business user may indicate to the system that for every sequence wherein a customer asks about a product three times but does not invoke an action that results in a buy, the system is then to provide some promotional material to the customer in real-time. As another example, a technical operations department may issue event queries to the CEP engine, in real time, asking about the number of server instances being brought online and the probability that there may be a deficit in persistence storage to support the servers. Focusing on events, while extraordinarily powerful, biases what can be cognized. That is, what you can think of, you can explore. What you can think of, you can respond to. However, by adding the Event Cloud, or event stream, to the pool of elements being observed, emergent patterns not previously considered can be brought to light. This is the crux of this paper, using the Event Cloud as a porthole into unconsidered situations emerging. EVENT CLOUDS HAVE FORM As represented in Figure 5, there is a point wherein events flowing through a CEP engine are unprocessed. This point is an Event Cloud, which may or may not be physically located within a CEP engine memory space. This Event Cloud has events entering its logical space and leaving it. The only bias to the events travelling through the CEP engine s Event Cloud is based on which event sources are serving as inputs to the particular CEP engine. For environments wherein all events, regardless of source, are sent to a common CEP engine, there is no bias of events within the Event Cloud. There are a number of attributes about the Event Cloud that can be captured, depending upon a particular CEP s implementation. For example, if an Event Cloud is managed Input Adapter Input Adapter Input Adapter Input Adapter Event Ingress Bus Event Cloud Filter Union Correlate Match Apply Rules Output Bus Output Adapter Output Adapter Output Adapter Figure 5: CEP Engine Components Source: Infosys Research 59
62 Event Cloud Event Cloud Steady State Shift Event A Event S Event M Event A Event S Event A Event M Event M Event A Buy Look Look Buy Ask Buy Ask Buy Buy Event Cloud Steady State Form Ask Ask Ask Ask Ask Buy Ask Event Cloud New Form Figure 6: Event Cloud (The Events traversing an Event Cloud at any particular moment give it shape and size) Source: Infosys Research Figure 7: Event Cloud Shift (Shape shifts as new patterns occur) Source: Infosys Research in memory and is based on a time window, for e.g., events of interest only stay within consideration by the engine for a period of time, then the number of events contained within an Event Cloud can be counted. If the structure holding an Event Cloud expands and contracts with the events it is funneling, then the memory footprint of the Event Cloud can be measured. In addition to the number of events and the memory size of the containing unit, the counts of the event types themselves that happen to be present at a particular time within the Event Cloud become a measurable characteristic. These properties, viz., memory size, event counts, and event types, can serve as measurable characteristics describing an Event Cloud, giving it a size and shape Figure 6. EVENT CLOUD STEADY STATE The properties of an Event Cloud that give it form can be used to measure its state. By collecting its state over time, a normative operating behavior can be identified and its steady state can be determined. This steady state is critical when watching for unpredicted patterns. When a new flow pattern of events causes an Event Cloud s shape to shift away from its steady state, a situation change has occurred Figure 7. When these steady state deviations happen, and if no new matching patterns or rules are being invoked, then an unknown-unknown may have emerged. That is, something significant enough to adjust your systems operating characteristics has occurred yet isn t being acknowledged in some way. Either it has been predicted but determined to not be important, or it was simply not considered. ANOMALY DETECTION APPLIED TO EVENT CLOUD STEADY STATE SHIFTS Finding patterns in data that do not match a baseline pattern is the realm of anomaly detection. As such, by using the steady state of an Event Cloud as the baseline we can apply anomaly detection techniques to discern a shift. Table 1 presents a catalog of various anomaly detection techniques that are applicable to Event Cloud shift discernment. This list isn t to serve as an exhaustive compilation, but rather to showcase the variety of possibilities. Each algorithm has its own set of strengths 60
63 Technique Classification Classification Based Nearest Neighbour Based Clustering Based Statistical Example Constituent Techniques Neural Networks Bayesian Networks Support Vector Machines Rule Distance to kth Nearest Neighbour Relative Density Parametric Non-Parametric Event Cloud Shift Applicability Challenges Accurately labeled training data for the classifiers is difficult to obtain Defining meaningful distance measures difficult Histogram approaches miss unique combinations Spectral Low Variance PCA Eigenspace - Based High computational complexity Table 1: Applicability of Anomaly Detection Techniques to Event Cloud Steady State Shifts Source: Derived from Anomaly Detection: A survey [12] such as simplicity, speed of computation, and certainty scores. Each algorithm, likewise, has weaknesses to include computational demands, blind spots in data deviations, and difficulty in establishing a baseline for comparison. All of these factors must be considered when selecting an appropriate algorithm. Using the three properties defined for an Event Cloud s shape (for e.g., event counts, event types, and Event Cloud size) combined with time properties, we have a multivariate data instance with three of them being continuous types, viz., counts, sizes, and time and one being categorical, viz., types. These four dimensions, and their characteristics, become a constraint on which anomaly detection algorithms can be applied [13]. The anomaly type being detected is also a constraint. In this case, the Event Cloud deviations are being classified as collective anomaly. It is collective anomaly, as opposed to point anomaly or context anomaly as we are comparing a collection of data instances that form the Event Cloud shape with a broader set of all data instances that formed the Event Cloud steady state shape. Statistical algorithms lend themselves well to anomaly detection when analyzing continuous and categorical data instances. Further, knowing an Event Cloud s steady state shape a priori isn t assumed, so the use of a nonparametric statistical model is appropriate [13]. Therefore, the technique of statistical profiling using histograms is explored as an example implementation approach for catching a steady state shift. One basic approach to trap the moment of an Event Cloud s steady state shift is to leverage a histogram based on each event type, with the number of times a particular count of an event type shows up in a given Event Cloud instance becoming a basis for comparison. The histogram generated over time would then serve as the baseline steady state picture of normative behavior. Individual instances of an Event Cloud s shape could then be compared to the Event Cloud s steady state histogram to discern if a deviation has occurred. That is, does the particular Event Cloud instance contain counts of events that have rarely, or never, appeared in the Event Cloud s history. Figure 8 represents the case with a steady state histogram on the left, and the Event Cloud comparison instance on the right. In this depiction the histogram shows, as an example, that three Ask Events were contained within an Event Cloud instance exactly once in the history of this Event Cloud. The Event Cloud 61
64 1 Ask Event (s) instance, on the right, that will be compared shows that the instance has six Ask Events in its snap shot state. Event Cloud Histogram and Instance Comparison Buy Event (s) Look Event (s) Event Cloud Steady State Histogram Ask Event Buy Event Event Cloud Comparison Instance Figure 8: Event Cloud Histogram & Comparison Source: Infosys Research An anomaly score for each event type is calculated, by comparing each Event Cloud instance event type count to the event type quantity occurrence bins within the Event Cloud steady state histogram, and then these individual scores are combined for an aggregate score [13]. This aggregate score then becomes the basis upon which a judgment is made regarding a whether deviation has occurred or not. While simple to implement, the primary weakness of using the histogram based approach is that a rare combination of events in an Event Cloud would not be detected, if the quantities of the individual events present were in their normal or frequent quantities. Look Event LIMITATIONS OF EVENT CLOUD SHIFTS Anomaly detection algorithms have blind spots, or situations where they cannot discern an Event Cloud shift. This implies that it is possible for an Event Cloud to shift undetected, under just the right circumstances. However, following the lead suggested by Okamoto and Ishida with immunity-based anomaly detection systems [13], rather than having A A A A A A B B B B a single observer detecting when an Event Cloud deviates from steady state, a system could have multiple observers, each with their own techniques and approaches applied. Their individual results could then be aggregated, with varying weights applied to each technique, to render a composite Event Cloud steady state shift score. This will help remove the chances of missing a state change shift. With the approach outlined by this paper, the scope of indicators is such that you get an early indicator that something new is emerging and nothing more. Noticing an Event Cloud shift only indicates that a situational change has occurred; it does not identify or highlight what the root cause of the change is, nor does it fully explain what is happening. Analysis is still required to determine what initiated the shift along with what opportunities for exploitation may be present. FURTHER RESEARCH Many enterprise CEP implementations are architected in layers, wherein event abstraction hierarchies, event pattern maps and event processing networks are used in concert to increase the visibility aspects of the system [14] as well as to help with overall performance by allowing for the segmenting of Event flows. In general, each layer going up the hierarchy is an aggregation of multiple events from its immediate child layer. With the lowest layer containing the finest grained events and the highest layer containing the coarsest grained events, the Event Clouds that manifest at each layer are likewise of varying granularity (Figure 9). Therefore a noted Event Cloud steady state shift at the lowest layer represents the finest granularity shift that can be observed. An Event Cloud s steady state shifts at the highest layer represent the coarsest steady 62
65 AN Events S A M A Figure 9: Event Hierarchies Source: Infosys Research state shifts that can be observed. Techniques for interleaving individual layer Event Cloud steady state shifts along with opportunities and consequences of their mixed granularity can be explored. The technique presented in this paper is designed to capture the beginnings of a situational change not explicitly coded for. With the recognition of a new situation emerging, the immediate task is to discern what is happening and why, while it is unfolding. Further research can be done to discern which elements available from the steady state shift automated analysis would be of value to help an analyst business or technical -- unravel the genesis of the situation change. By discovering what change information is of value, not only can an automated alert be sent to interested parties, but it can contain helpful clues on where to start their analysis. CONCLUSION CEP In Layers TH S M S It would be an understatement that without the right set of systems, methodologies, controls, checks and balances on data, no enterprise can survive. Big data solves the problem of vastness and multiplicity of the ever rising information in this information age. What Big data does not fulfill is the complexity associated with real time S Event Clouds S TH AN A M A S M data analysis. CEP though designed purely for events complements the Big data strategy of any enterprise. Event Cloud, a constituent component of CEP can be used for more than its typical application. By treating it as a first class citizen of indicators, and not just a collection point computing construct, a company can gain insight into the early emergence of something new, something previously not considered and potentially the birthing of an unknownunknown. With organizations growing in their usage of Big data, and the desire to move closer to real time response, companies will inevitably leverage the CEP paradigm. The question will be do they use it as everyone else does, triggering off of conceived patterns, or will they exploit it for unforeseen situation emergence? When the situation changes, the capability is present and the data is present, but are you? REFERENCES 1. WSJ article on Big data. Available at html. 2. Transaction Processing Council Benchmark comparison or leading databases. Available at org/tpcc/results/tpcc_perf_results.asp. 3. Transaction Processing Council Benchmark comparison or leading databases. Available at org/tpcc/results/tpcc_perf_results.asp. 4. Apache Hadoop project site. Available at 5. IBM Watson Artificial intelligent super computer s Home Page. Available at us/watson/. 63
66 6. IBM s Big data initiative. Available at data/bigdata/. 7. Oracle s Big data initiative. Available at technologies/big-data/index.html. 8. Teradata Big data Analytics offerings. Available at business-needs/big-data-analytics/. 9. Luckham, D. and Schulte, R. (2011), Event Processing Glossary Version 2.0, Compiled. Available at complexevents.com/2011/08/23/eventprocessing-glossary-version-2-0/. 10. Bass, T. (2007), What is Complex Event Processing? TIBCO Software Inc. 11. Bass, T. (2010), Orwellian Event Processing. Available at thecepblog.com/2010/02/28/orwellianevent-processing/. 12. Chandola, V., Banerjee, A., and Vipin Kumar, V. (2009), Anomaly Detection : A Survey, ACM Computing Surveys. 13. Okamoto, T. and Ishida, Y. (2009), An Immunity-Based Anomaly Detection System with Sensor Agents, sensor ISSN Luckham, D. (2002), The Power of Events, An Introduction to Complex Event Processing in Distributed Enterprise Systems, Addison Wesley, Boston. 15. Vincent, P. (2011), ACM Overview of BI Technology misleads on CEP. Available at com/2011/07/28/acm-overview-of-bitechnology-misleads-on-cep/. 16. About Esper and NEsper FAQ, esper.codehaus.org/tutorials/faq_ esper/faq.html#what-algorithms. 17. Ide, T. and Kashima, H. (2004), Eigenspace-based Anomaly Detection in Computer Systems, Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August pp
67 Infosys Labs Briefings VOL 11 NO Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing a structured testing technique Testing Big data is one of the biggest challenges faced by organizations because of lack of knowledge on what to test and how much data to test. Organizations have been facing challenges in defining the test strategies for structured and unstructured data validation, setting up an optimal test environment, working with non-relational databases and performing non-functional testing. These challenges are causing in poor quality of data in production and delayed implementation and increase in cost. Robust testing approach need to be defined for validating structured and unstructured data and start testing early to identify possible defects early in the implementation life cycle and to reduce the overall cost and time to market. Different testing types like functional and non-functional testing are required along with strong test data and test environment management to ensure that the data from varied sources is processed error free and is of good quality to perform analysis. Functional testing activities like validation of map reduce process, structured and unstructured data validation, data storage validation are important to ensure that the data is correct and is of good quality. Apart from functional validations other nonfunctional testing like performance and failover testing plays a key role to ensure the whole process is scalable and is happening within specified SLA. Big data implementation deals with writing complex Pig, Hive programs and running these jobs using Hadoop map reduce framework on huge volumes of data across different nodes. Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers. Hadoop uses Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. Hadoop utilizes its own distributed file system, HDFS, which makes data available to multiple computing nodes. Figure 1 shows the step by step process on how Big data is processed using Hadoop ecosystem. First step loading source data into 65
68 1 Loading Source 2 Perform Map data files into HDFS Reduce operations 3 Extract the output results from HDFS Figure 1: Big Data Testing Focus Areas Source: Infosys Research HDFS involves in extracting the data from different source systems and loading into HDFS. Data is extracted using crawl jobs for web data, tools like sqoop for transactional data and then loaded into HDFS by splitting into multiple files. Once this step is completed second step perform map reduce operations involves in processing the input files and applying map and reduce operations to get a desired output. Last setup extract the output results from HDFS involves in extracting the data output generated out of second step and loading into downstream systems which can be enterprise data warehouse for generating analytical reports or any of the transactional systems for further processing BIG DATA TESTING APPROACH As we are dealing with huge data and executing on multiple nodes there are high chances of having bad data and data quality issues at each stage of the process. Data functional testing is performed to identify these data issues because of coding errors or node configuration errors. Testing should be performed at each of the three phases of Big data processing to ensure that data is getting processed without any errors. Functional Testing includes (i) validation of pre-hadoop processing; (ii), validation of Hadoop Map Reduce process data output; and (iii) validation of data extract, and load into EDW. Apart from these functional validations non-functional testing including performance testing and failover testing needs to be performed. Figure 2 shows a typical Big data architecture diagram and highlights the areas where testing should be focused. Validation of Pre-Hadoop Processing Data from various sources like weblogs, social network sites, call logs, transactional data etc., is extracted based on the requirements and loaded into HDFS before processing it further. Issues: Some of the issues which we face during this phase of the data moving from source 66
69 Big Data Analytics 1 Reporting using BI Tools 25% 25% 25% 25% Bar graph hadoop Pre-Hadoop process validation Web Logs 4 5 ReportsTesting Pig HBase (NoSQL DB) Big Data Testing Focus Areas Enterprise Data Warehouse HIVE Map Reduce (Job Execution) HDFS (Hadoop Distributed File System) Streaming Data Processed Data Data Load using Sqoop Social Data 3 2 ETL Process validation ETL Process Map-Reduce process validation Transactional Data (RDBMS) Non-FunctionalT esting (Performance, Fail over testing) 4 Figure 2: Big Data architecture Source: Infosys Research systems to Hadoop are incorrect data captured from source systems, incorrect storage of data, incomplete or incorrect replication. Validations: Some high level scenarios that need to be validated during this phase include: 1. Comparing input data file against source systems data to ensure the data is extracted correctly 2. Validating the data requirements and ensuring the right data is extracted, 3. Validating that the files are loaded into HDFS correctly, and Validation of Hadoop Map Reduce Process Once the data is loaded into HDFC Hadoop map-reduce process is run to process the data coming from different sources. Issues: Some issues that we face during this phase of the data processing are coding issues in map-reduce jobs, jobs working correctly when run in standalone node, but working incorrectly when run on multiple nodes, incorrect aggregations, node configurations, and incorrect output format. Validations: Some high level scenarios that need to be validated during this phase include: 4. Validating the input files are split, moved and replicated in different data nodes. 1. Validating that data processing is completed and output file is generated 67
70 2. Validating the business logic on standalone node and then validating after running against multiple nodes 3. Validating the data load in target system 4. Validating the aggregation of data 3. Validating the map reduce process to verify that key value pairs are generated correctly 4. Validating the aggregation and consolidation of data after reduce process 5. Validating the data integrity in the target system. Validation of Reports Analytical reports are generated using reporting tools by fetching the data from EDW or running queries on Hive. 5. Validating the output data against the source files and ensuring the data processing is completed correctly 6. Validating the output data file format and ensuring that the format is per the requirement. Issues: Some of the issues faced while generating reports are report definition not set as per the requirement, report data issues, layout and format issues. Validations: Some high level validations performed during this phase include: Validation of Data Extract, and Load into EDW Once map-reduce process is completed and data output files are generated, this processed data is moved to enterprise data warehouse or any other transactional systems depending on the requirement. Issues: Some issues that we face during this phase include incorrectly applied transformation rules, incorrect load of HDFS files into EDW and incomplete data extract from Hadoop HDFS. Validations: Some high level scenarios that need to be validated during this phase include: 1. Validating that transformation rules are applied correctly 2. Validating that there is no data corruption by comparing target table data against HDFS files data Reports Validation: Reports are tested after ETL/transformation workflows are executed for all the sources systems and the data is loaded into the DW tables. The metadata layer of the reporting tool provides an intuitive business view of data available for report authoring. Checks are performed by writing queries to verify whether the views are getting the exact data needed for the generation of the reports. Cube Testing: Cubes are testing to verify that dimension hierarchies with pre-aggregated values are calculated correctly and displayed in the report. Dashboard Testing: Dashboard testing consists of testing of individual web parts and reports placed in a dashboard. Testing would involve ensuring all objects are rendered properly and the resources on the webpage are current and latest. The data fetched from various web parts is validated against the databases. 68
71 VOLUME, VARIETY AND VELOCITY: HOW TO TEST? In the earlier sections we have seen step by step details on what need to be tested at each phase of the Big data processing. During these phases of Big data processing the three dimensions or characteristics of Big data i.e. volume, variety and velocity are validated to ensure there are no data quality defects and no performance issues. Volume: The amount of data created both inside corporations and outside the corporations via the web, mobile devices, IT infrastructure, and other sources is increasing exponentially each year [3]. Huge volume of data flows from multiple systems which need to be processed and analyzed. When it comes to validation it is a big challenge to ensure that whole data setup processed is correct. Manually validating the whole data is a tedious task. We should use compare scripts to validate the data. As data is stored in HDFS is in file format scripts can be written to compare two files and extract the differences using compare tools [4]. Even if we use compare tools it will take a lot of time to do 100% data comparison. To reduce the time for execution we can either run all the comparison scripts in parallel on multiple nodes just like how data is processed using Hadoop mapreduce process or sample the data ensuring maximum scenarios are covered. Figure 3 shows the approach on how voluminous amount of data is compared. Data is converted into expected result format and then compared using compare tools with actual data. This is a faster approach but involves initial scripting time. This approach will reduce further regression testing cycle time. When we don t have time to validate complete data, sampling can be done for validation. Variety: The variety of data types is increasing, namely unstructured text-based data and semistructured data like social media data, locationbased data, and log-file data. Structured Data is data which is in defined format which is coming from different RDBMS tables or from structured files. The data that is of transactional nature can be handled in files or tables for validation purpose. Map Reduce Jobs Testing Scripts to validate data in HDFS Unstructured (data) testing Structured data 1 SD Test 2 SD1 Test1 Unstructured to Structured Unstructured (data) testing Custom scripts to convert unstructured data to structured data Structured data testing Map Reduce Jobs run in test environment to generate the output Raw Data to Expected Results format Structured data testing Scripts to convert data to expected results data Expected Results Output Data Files Tool to compare the files Actual Results File by File Comparison Discrepancy Report Expected Results Figure 3: Approach for High Volume Data Validation Source: Infosys Research 69
72 Semi-structured data does not have any defined format but structure can be derived based on the multiple patterns of the data. Example of data is extracted by crawling through different websites for analysis purposes. For validation data need to be first transformed into structured format using custom built scripts. First the pattern need to be identified and then copy books or pattern outline need to be prepared, later this copy book need to be used in scripts to convert the incoming data into a structured format and then validations performed using compare tools. Unstructured data is the data that does not have any format and is stored in documents or web content, etc. Testing unstructured data is very complex and is time consuming. Automation can be achieved to some extent by converting the unstructured data into structured data using scripting like PIG scripting as showing in Figure 3. But the overall coverage using automation will be very less because of unexpected behavior of data; input data can be in any form and changes every time new test is performed. We need to deploy a business scenario validation strategy for unstructured data. In this strategy we need to identify different scenarios that can occur in our day to day unstructured data analysis and test data need to be setup based on test scenarios and executed. Velocity: The speed at which new data is being created and the need for real-time analytics to derive business value from it -- is increasing thanks to digitization of transactions, mobile computing and the sheer number of internet and mobile device users. Data speed needs to be considered when implementing any Big data appliance to overcome performance problems. Performance testing plays an important role to identify any performance bottleneck in the system and the system can handle high velocity streaming data. NON-FUNCTIONAL TESTING In the earlier sections we have seen how functional testing is performed at each phase of Big data processing, these tests are performed to identify functional coding issues, requirements issues. Performance testing and failover testing need to be performed to identify performance bottlenecks and to validate the non-functional requirements. Performance Testing: Any Big data project involves in processing huge volumes of structured and unstructured data and is processed across multiple nodes to complete the job in less time. At times because of bad architecture and poorly designed code, performance is degraded. If the performance is not meeting the SLA, the purpose of setting up Hadoop and other Big data technologies is lost. Hence, performance testing plays a key role in any Big data project due to huge volume of data and complex architecture. Some of the areas where performance issues can occur are imbalance in input splits, redundant shuffle and sorts, moving most of the aggregation computations to reduce process which can be done at map process. [5]. These performance issues can be eliminated by carefully designing the system architecture and doing performance test to identify the bottlenecks. Performance testing is conducted by setting up huge volume of data and an infrastructure similar to production. Utilities like Hadoop performance monitoring tool can be used to capture the performance metrics and identify the issues. Performance metrics like 70
73 job completion time, throughput, and system level metrics like memory utilization etc. are captured as part of performance testing. Failover Testing: Hadoop architecture consists of a name node and hundreds of data notes hosted on several server machines and each of them are connected. There are chances of node failure and some of the HDFS components become non-functional. Some of the failures can be name node failure, data node failure and network failure. HDFS architecture is designed to detect these failures and automatically recover to proceed with the processing. Failover testing is an important focus area in Big data implementations with the objective of validating the recovery process and to ensure the data processing happens seamlessly when switched to other data nodes. Some validations that need to be performed during failover testing are validating that checkpoints of edit logs and FsImage of name node are happening at a defined intervals, recovery of edit logs and FsImage files of name node, no data corruption because of the name node failure, data recovery when data node fails and validating that replication is initiated when one of data node fails or data become corrupted. Recovery Time Objective (RTO) and Recovery Point Objective (RPO) metrics are captured during failover testing. TEST ENVIRONMENT SETUP As Big data involves handling huge volume and processing across multiple nodes, setting up a test environment is the biggest challenge. Setting up the environment on cloud will give us the flexibility to setup and maintain it during test execution. Hosting the environment on the cloud will also help in optimizing the infrastructure and faster time to market. Key steps involved in setting up environment on cloud are [6]: A. Big data Test infrastructure requirement assessment 1. Assess the Big data processing requirements 2. Evaluate the number of data nodes required in QA environment 3. Understand the data privacy requirements to evaluate private or public cloud 4. Evaluate the software inventory required to be setup on cloud environment (Hadoop, File system to be used, No SQL DBs, etc). B. Big data Test infrastructure design 1. Document the high level cloud test infrastructure design (Disk space, RAM required for each node, etc.) 2. Identify the cloud infrastructure service provider 3. Document the SLAs, communication plan, maintenance plan, environment refresh plan 4. Document the data security plan 5. Document high level test strategy, testing release cycles, testing types, volume of data processed by Hadoop, third party tools required. 71
74 C. Big data Test Infrastructure Implementation and Maintenance Create a cloud instance of Big data test environment Install Hadoop, HDFS, MapReduce and other software as per the infrastructure design Perform a smoke test on the environment by processing a sample map reduce, Pig/Hive jobs functional and non-functional requirements. Applying right test strategies and following best practices will improve the testing quality which will help in identifying the defects early and reduce overall cost of the implementation. It is required that organizations invest in building skillset both in development and testing. Big data testing will be a specialized stream and testing team should be built with diverse skillset including coding, white-box testing skills and data analysis skills for them to perform a better job in identifying quality issues in data. Deploy the code to perform testing. BEST PRACTICES Data Quality: It is very important to establish the data quality requirements for different forms of data like traditional data sources, data from social media, data from sensors, etc. If the data quality is ascertained, the transformation logic alone can be tested, by executing tests against all possible data sets. Data Sampling: Data sampling gains significance in Big data implementation and it becomes the testers job to identify suitable sampling techniques that includes all critical business scenarios and the right test data set. Automation: Automate the test suites as much as possible. The Big data regression test suite will be used multiple times as the database will be periodically updated. Hence an automated regression test suite should be built to use it after reach release. This will save a lot of time during Big data validations. CONCLUSION Data quality challenges can be encountered by deploying a structured testing approach for both REFERENCES 1. Big data overview, Wikipedia.org at 2. White, T. (2010), Hadoop- The Definitive Guide 2nd Edition, O Reilly Media. 3. Kelly, J. (2012), Big data: Hadoop, Business Analytics and Beyond, A Big data Manifesto from the Wikibon Community. Available at wikibon.org/wiki/v/big_data:_ Hadoop,_Business_Analytics_and_ Beyond, Mar Informatica Enterprise Data Integration (1998), Data verification using File and Table compare utility for HDFS and Hive tool. Available at informatica.com/solutions/ Bhandarkar M. (2009), Practical Problem Solving with Hadoop, USENIX 09 annual technical conference, June Available at event/usenix09/training/tutonefile.html. 6. Naganathan, V. (2012), Increase Business Value with Cloud-based QA Environments, Available at infosys.com/it-services/independentvalidation-testing-services/pages/ cloud-based-qa-environments.aspx. 72
75 Infosys Labs Briefings VOL 11 NO Nature Inspired Visualization of Unstructured Big Data By Aaditya Prakash Reconstruct self-organizing maps as spider graphs for better visual interpretation of large unstructured datasets Exponential growth of data capturing devices has led to an explosion of data available. Unfortunately not all data available is in the database friendly format. Data which cannot be easily categorized, classified or imported into database are termed Unstructured Data. Unstructured data is ubiquitous and is assumed to be around 80% of all data generated [1]. While tremendous advancements have taken place for analyzing, mining and visualizing structured data, the field of unstructured data, especially unstructured Big data is still in nascent stage. Lack of recognizable structure and huge size makes it very challenging to work with unstructured large datasets. Classical visualization methods limit the amount of information presented and are asymptotically slow with rising dimensions of the data. We present here a model to mitigate these problems and allow efficient and vast visualization of large unstructured datasets. A novel approach in unsupervised machine learning is Self-Organizing Maps (SOM). Along with classification, SOMs have added benefit of dimensionality reduction. SOMs are also used for visualizing multidimensional data into 2D planar diffusion map. This achieves data reduction thus enabling visualization of large datasets. Present models used to visualize SOM maps lack any deductive ability that may be defeating the power of SOM. We introduce better restructuring of SOM trained data for more meaningful interpretation of very large data sets. Taking inspiration from the nature, we model the large unstructured dataset into spider cobweb type graphs. This has the benefit of allowing multivariate analysis as different variables can be presented into one spider graph and their inter-variable relations can be projected, which cannot be done with classical SOM maps. 73
76 UNSTRUCTURED DATA Unstructured data come in different formats and sizes. Broadly the textual data, sound, video, images, webpages, logs, s, etc., are categorized into unstructured data. In some cases even a bundle of numeric data could be collectively unstructured, for e.g., health records of a patient. While a table of cholesterol level of all the patients is more structured, all the biostats of a single patient is largely unstructured. Unstructured data could be of any form and could contain any number of independent variables. Labeling as is done in machine learning is only possible with data where information of variable such as size, length, dependency, precision, etc., is known. Even extraction of the underlying information in a cluster of unstructured data is very challenging because it is not known on what is to be extracted [2]. The potential of hidden analytics within the unstructured large datasets could be a valuable asset to any business or research entity. Consider the case of Enron s (collected and prepared by CALO project). s are primarily unstructured, mostly because people often reply above the last even when the new s content and purpose might be different. Therefore most organizations do not analyze s or logs but several researchers analyzed the Enron s and their results show that lot of predictive and analytical information could be obtained from the same [3, 4, 5]. SELF ORGANIZING MAPS Ability to harness the increased computing power has been a great boon to business. From traditional business analytics to machine learning, the knowledge we get from data is invaluable. With computing forecasted to get faster, may be quantum computing someday, it promises greater role for the data. While there has been a lot of effort to bring some structure into unstructured data [6], the cost of doing so has been the hindrance. With larger datasets it is even a greater problem as it entails more randomness and unpredictability in the data. Self-Organizing Maps (SOM) are a class of artificial neural networks proposed by Teuvo Kohonen [7] that transform the input dataset into two dimensional lattice, also called Kohonen Map. Structure All the points of the input layer are mapped onto two dimensional lattice, called as Kohonen Network. Each point in the Kohonen Network is potentially a Neuron. Figure 1: Kohonen Network Source: Infosys Research Competition of Neurons Once the Kohonen Network is completed the neurons of the network compete according to the weights assigned from the input layer. Function used to declare the winning neuron is the simple Euclidean distance of the input point and its corresponding weight for each of 74
77 the neuron. The function called as discriminant function is represented as, where, x = point on Input Layer w = weight of the input point (x) i = all the input points j = all the neurons on the lattice d = Euclidean distance Simply put, the winning neuron is the one whose weight is closest (distance in lattice) to the input layer. This process effectively discretizes the output layer. Since the formation of topological structuring is independent of the input points it can easily be parallelized. Carpenter et.al. have demonstrated the ability of SOM to work under massively parallel processing[9]. Kohonen himself has shown that even where the input data may not be in vector form, as found in some unstructured data, large scale SOM can be run nonetheless[10]. SOM PLOTS SOM plots are a two dimensional representation of the topological structure obtained after training the neural nets for given number of repetitions and with given radius. The SOM can be visualized as a complete 2-D topological structure [Fig.2]. Cooperation of Neighboring Neurons Once the winning neuron is found, the topological structure can be determined. Similar to the behavior in human brain cells (neurons), the winning neuron also excites its neighbor. Thus the topological structure is determined by the cooperative weights of the winning neuron and its neighbor. Self-Organization The process of selecting winning neurons and formation of topological structure is adaptive. The process runs multiple times to converge on the best mapping of the given input layer. SOM is better than other clustering algorithms in that it requires very few repetitions to get to a stable structure. Parallel SOM for large datasets Among all classifying machine learning algorithms, convergence speed of the SOM has been found to be the fastest [8]. This implies that for large data sets SOM is the best viable model. Figure 2: SOM Visualization using Rapidminer (AGPL Open Source) Source: Infosys Research Figure 2, shows the overall topological structure obtained after dimensionality reduction of multivariate dataset. While the graph above may be useful for outlier detection or general categorization it isn t very useful in analysis of individual variables. Other option of visualizing SOM is to plot different variables in grid format. One can use R programming language (GNU Open Source) to plot the SOM results. 75
78 Figure 3: SOM Visualization in R using the Package Kohonen Source: Infosys Research Figure 4: SOM visualization in R using the package SOM Source: Infosys Research Note on running example All the plots presented henceforth have been obtained using R programming language. Dataset used is SPAM Database. Database is in public domain and freely available for research at UCI Machine Learning Repository. It contains word instances of 4601 SPAM s. s are good example of unstructured data. Using the public packages in R, we obtain the SOM plots. Figure 3, is the plot of SOM trained result using the package Kohonen [11]. This plot gives inter-variable analysis. In this case variable being 4 of one the most used words in the SPAM database viz. order, credit, free and money. While this plot is better than topological plot as given in Figure 2, it is still difficult to interpret the result in canonical sense. Figure 4, is again the SOM plot of the above given four most common words in the SPAM database but this one uses the package called SOM [12]. While this plot is numerical and gives strength of intervariatek relationship it does not help in giving us the analytical picture. The information obtained is not actionable. SPIDER PLOTS OF SOM As we have seen in the Figures 2, 3 and 4 the current visualization of SOM output could be improved for more analytical ability. We introduce a new method to plot SOM output especially designed for large datasets. Algorithm 1. Filter the results of SOM. 2. Make a polygon with as many sides as the variables in input. 3. Make the radius of the polygon to be the maximum of the value in the dataset. 4. Draw the grid for the polygon. 5. Make segments inside the polygon if the strength of the two variables inside the segment is greater than the specified threshold. 6. Loop Step v for every variable against every other variable 7. Color the segments based on the frequency of variable. 8. Color the line segments based on the threshold of each variable pair plotted. 76
79 Figure 5: SOM Visualization in R Using the Above Algorithm: Showing Segment, i.e., inter-variable dependency Source: Infosys Research Figure 7: Spider Plot showing 25 Sampled Words from the Spam Database Source: Infosys Research Plots As we can see, this plot is more meaningful than the SOM visualization plots obtained before. From the figure we can easily deduce that the words free and order do not have similar relation as credit and money. Understandably so, because if a Spam is selling something, it will probably have the words order and conversely if it is advertising any product or software for free download then it wouldn t have the words order in it. High relationship between credit and money signifies Spam s advertising for better Credit Score programs and other marketing traps. Figure 6 shows the relationship of each variable-- in this case four popular recurring words in the Spam database. The number of threads between one variable to another shows the probability of second variable given the first variable. Several threads between free and credit suggests that Spam s offering free credit (disguised in other forms by fees or deferred interests) are among the most popular. Using these Spider plots we can analyze several variables at once. This may cause the graph to be messy but sometimes we need to see the complete picture in order to make canonical decisions about the dataset. From Figure 7 we can see that even though the figure shows 25 variables it is not as cluttered as a Scatter Plot or Bar chart would be if plotted with 25 variables. Figure 6: SOM visualization in R using Above Algorithm: Showing Threads, i.e., inter-variable strength) Source: Infosys Research Figure 8: Uncolored Representation of Threads in Six variables Source: Infosys Research 77
80 Figure 8 shows the different levels of strength between different variables. While contact variable is strong with need but not enough with help it is no surprise that you and need are strong. Here the idea was only to present the visualization technique and not the analysis of Spam dataset. For more analysis on Spam filtering and Spam analysis one may refer to several independent works on the same [13, 14]. ADVANTAGES There are several visual and non-visual advantages of using this new plot against the existing plot obtained. This plot has been designed to handle Big data. Most of the existing plots mentioned above are limited in their capacity to scale. Principally if the range of data is large then most of the existing plots tend to get skewed and important information is lost. By normalizing the data this new plot prevents this issue. By allowing multiple dimensions to be incorporated allows for recognition of indirect relationships. CONCLUSION While unstructured data is abundant, free and hidden with information the tools of analyzing the same are still nascent and cost of converting them to structured form is very high. Machine learning is used to classify unstructured data but comes with issues of speed and space constraints. SOM are the fastest machine learning algorithms but their visualization powers are limited. We have presented a naturally intuitive method to visualize SOM outputs which facilitates multi-variable analysis and is also highly scalable. REFERENCE 1. Grimes, S., Unstructured data and the 80 percent rule. Retrieved from aspx?tabid= Doan, A., Naughton, J. F., Ramakrishnan, R., Baid, A., Chai, X., Chen, F. and Vuong, B. Q. (2009), Information extraction challenges in managing unstructured data, ACM SIGMOD Record, vol. 37, no. 4, pp Diesner, J., Frantz, T. L. and Carley, K. M. (2005). Communication networks from the Enron corpus It s always about the people. Enron is no different. In Computational & Mathematical Organization Theory, vol. 11, no. 3, pp Chapanond, A., Krishnamoorthy, M. S., & Yener, B. (2005), Graph theoretic and spectral analysis of Enron data. In Computational & Mathematical Organization Theory, vol. 11, no.3, pp Peterson, K., Hohensee, M., and Xia, F. (2011), formality in the workplace: A case study on the enron corpus. In Proceedings of the Workshop on Languages in Social Media, pp Association for Computational Linguistics. 6. Buneman, P., Davidson, S., Fernandez, M., and Suciu, D. (1997), Adding structure to unstructured data. Database Theory ICDT 97, pp Kohonen, T. (1990),The self-organizing map. Proceedings of the IEEE, vol. 78, no. 9, pp Waller, N. G., Kaiser, H. A., Illian, J. B., and Manry, M. (1998), A comparison of the classification capabilities of the 1-dimensional kohonen neural network with two pratitioning and three hierarchical cluster analysis algorithms. Psychometrika, vol. 63, no.1, pp
81 9. Carpenter, G. A., and Grossberg, S. (1987), A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer vision, graphics, and image processing, vol. 37, no. 1, pp Kohonen, T., and Somervuo, P. (2002), How to make large self-organizing maps for non-vectorial data. Neural Networks, vol.15, no. 8, pp Wehrens, R & Buydens, L.M.C (2007), Self- and Super-organizing Maps in R: The Kohonen Package. Journal of Statistical Software, vol. 21, no. 5, pp Yan, J. (2012), Self-Organizing Map (with application in gene clustering) in R. Available at web/packages/som/som.pdf. 13. Dasgupta, A., Gurevich, M., & Punera, K. (2011), Enhanced spam filtering through combining similarity graphs. In Proceedings of the fourth ACM international conference on Web search and data mining, pp Cormack, G. V. (2007), spam filtering: A systematic review. Foundations and Trends in Information Retrieval, vol. 1, no. 4, pp
82 NOTES
83 Index Automated Content Discovery 48, 49, Big Data Analytics 4-8, 19, 24, 40-43, 45, 67, Lifecycle 21, Medical Engine Value, also BDV 27, 29, Campaign Management 31, 32, Common Warehouse Meta-Model, also CWM 7 Communication Service Providers, also CSPS 27, Complex Event Processing, also CEP Content Processing Workflows 50 Publishing Lifecycle Management, also CPLM 48, Management System, also CMS 30, 48, 51 Contingency Funding Planning, also CFP 36, Customer Dynamics 19-21, 25 Relationship 28, 30 Data Warehouse 4-5, 30, 38-39, 66, 68 Enterprise Service Bus, also ESB 30 Event Driven Process Automation Architecture, also EDA Experience Personalization 31 Extreme Content Hub, also ECH Global Positioning Service, also GPS 10, 13, 17, 54, 56 Management Business Process, also BPM 30, Custom Relationship, also CRM Information 3, Liquidity Risk, also LRM Master Data 5-6 Offer 32 Order 30 Retention 31, 32 Metadata Discovery 6-7 Extractor 50, Governance 6-7 Management 3-8 Net Interest Income Analysis, also NIIA 37 Predictive Intelligence 19 Modeling 32 Analytics 54 Service Management 31, 33 Supply Chain Planning 9-12, 53 Un-Structured Content Extractor 50 Web Analytics 21 81
84 Infosys Labs Briefings BUSINESS INNOVATION through TECHNOLOGY Editor Praveen B Malla PhD Deputy Editor Yogesh Dandawate Graphics & Web Editor Rakesh Subramanian Chethana M G Vivek Karkera IP Manager K V R S Sarma Marketing Manager Gayatri Hazarika Online Marketing Sanjay Sahay Production Manager Sudarshan Kumar V S Database Manager Ramesh Ramachandran Distribution Managers Santhosh Shenoy Suresh Kumar V H How to Reach Us: [email protected] Editorial Office: Infosys Labs Briefings, B-19, Infosys Ltd. Electronics City, Hosur Road, Bangalore , India [email protected] Infosys Labs Briefings is a journal published by Infosys Labs with the objective of offering fresh perspectives on boardroom business technology. The publication aims at becoming the most sought after source for thought leading, strategic and experiential insights on business technology management. Infosys Labs is an important part of Infosys commitment to leadership in innovation using technology. Infosys Labs anticipates and assesses the evolution of technology and its impact on businesses and enables Infosys to constantly synthesize what it learns and catalyze technology enabled business transformation and thus assume leadership in providing best of breed solutions to clients across the globe. This is achieved through research supported by state-of-the-art labs and collaboration with industry leaders. About Infosys Many of the world s most successful organizations rely on Infosys to deliver measurable business value. Infosys provides business consulting technology, engineering and outsourcing services to help clients in over 32 countries build tomorrow s enterprise. For more information about Infosys (NASDAQ:INFY), visit Phone: Post: Infosys Labs Briefings, B-19, Infosys Ltd. Electronics City, Hosur Road, Bangalore , India Subscription: [email protected] Rights, Permission, Licensing and Reprints: [email protected] Infosys Limited, 2013 Infosys acknowledges the proprietary rights of the trademarks and product names of the other companies mentioned in this issue. The information provided in this document is intended for the sole use of the recipient and for educational purposes only. Infosys makes no express or implied warranties relating to the information contained herein or to any derived results obtained by the recipient from the use of the information in this document. Infosys further does not guarantee the sequence, timeliness, accuracy or completeness of the information and will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of, any of the information or in the transmission thereof, or for any damages arising therefrom. Opinions and forecasts constitute our judgment at the time of release and are subject to change without notice. This document does not contain information provided to us in confidence by our clients.
85 ART & SCIENCE AN INFOSYS PUBLICATION the DIGITAL ENTERPRISE U R HERE REVOLUTION Video Social Search.com Mobile Big Data IN ONE MINUTE ,000 Facebook status updates 47,000 App Store downloads Cloud 100,000 Tweets Apps 571 new websites 2 million Google searches CL 60 % UD OF SERVER WORKLOADS WILL BE VIRTUALIZED IN 2 YEARS 12% 60% MOBILITY HOW THE WORLD GETS ONLINE 5.5b via mobile 1.5b via desktop = 100 MILLION SOCIAL BUSINESS 23 % 62 % Fortune 500 Fortune 500 companies with blogs companies active on Twitter MOBILE STRATEGY 31 % OF COMPANIES REPORT THEY ARE JUST STARTING TO DEVELOP A MOBILE STRATEGY OR HAVE NO MOBILE STRATEGY AT ALL. ONLINE RETAIL U.S. OUTLOOK: GROWTH $226B million people 45% $327B 192 million people BIG DATA 90% OF THE WORLD S DATA WAS CREATED IN THE LAST 2 YEARS APPS WHERE MOBILE USERS SPEND TIME (billions of minutes per month) MAR APR MAY JUN Mobile Web JUL AUG SEP OCT NOV DEC JAN FEB MAR 2012 Mobile Apps Click here to explore the current issue of Art & Science
86 Authors featured in this issue AADITYA PRAKASH is a Senior Systems Engineer with the FNSP unit of Infosys. He can be reached at [email protected]. ABHISHEK KUMAR SINHA is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [email protected]. AJAY SADHU is a Software Engineer with the Big data practice under the Cloud Unit of Infosys. He can be contacted at [email protected]. ANIL RADHAKRISHNAN is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [email protected]. BILL PEER is a Principal Technology Architect with the Infosys Labs. He can be reached at [email protected]. GAUTHAM VEMUGANTI is a Senior Technology Architect with the Corp PPS unit of Infosys. He can be contacted at [email protected]. KIRAN KALMADI is a Lead Consultant with the FSI business unit of Infosys. He can be contacted at [email protected]. MAHESH GUDIPATI is a Project Manager with the FSI business unit of Infosys. He can be reached at [email protected]. NAJU D MOHAN is a Delivery Manager with the RCL business unit of Infosys. She can be contacted at [email protected]. NARAYANAN CHATHANUR is a Senior Technology Architect with the Consulting and Systems Integration wing of the FSI business unit of Infosys. He can be reached at [email protected]. NAVEEN KUMAR GAJJA is a Technical Architect with the FSI business unit of Infosys. He can be contacted at [email protected]. PERUMAL BABU is a Senior Technology Architect with RCL business unit of Infosys. He can be reached at [email protected]. PRAKASH RAJBHOJ is a Principal Technology Architect with the Consulting and Systems Integration wing of the Retail, CPG, Logistics and Life Sciences business unit of Infosys. He can be contacted at [email protected]. PRASANNA RAJARAMAN is a Senior Project Manager with RCL business unit of Infosys. He can be reached at [email protected]. SARAVANAN BALARAJ is a Senior Associate Consultant with Infosys Retail & Logistics Consulting Group. He can be contacted at [email protected]. SHANTHI RAO is a Group Project Manager with the FSI business unit of Infosys. She can be contacted at [email protected]. SUDHEESHCHANDRAN NARAYANAN is a Senior Technology Architect with the Big data practice under the Cloud Unit of Infosys. He can be reached at [email protected]. ZHONG LI PhD. is a Principal Architect with the Consulting and System Integration Unit of Infosys. He can be contacted at [email protected].
87 Subu Goparaju Senior Vice President and Head of Infosys Labs At Infosys Labs, we constantly look for opportunities to leverage technology while creating and implementing innovative business solutions for our clients. As part of this quest, we develop engineering methodologies that help Infosys implement these solutions right, first time and every time. For information on obtaining additional copies, reprinting or translating articles, and all other correspondence, please contact: Infosys Limited, 2013 Infosys acknowledges the proprietary rights of the trademarks and product names of the other companies mentioned in this issue of Infosys Labs Briefings. The information provided in this document is intended for the sole use of the recipient and for educational purposes only. Infosys makes no express or implied warranties relating to the information contained in this document or to any derived results obtained by the recipient from the use of the information in the document. Infosys further does not guarantee the sequence, timeliness, accuracy or completeness of the information and will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of, any of the information or in the transmission thereof, or for any damages arising there from. Opinions and forecasts constitute our judgment at the time of release and are subject to change without notice. This document does not contain information provided to us in confidence by our clients.
Big data, true to its name, deals with large
Infosys Labs Briefings VOL 11 NO 1 2013 Metadata Management in Big Data By Gautham Vemuganti Big data analytics must reckon the importance and criticality of metadata Big data, true to its name, deals
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
Navigating Big Data business analytics
mwd a d v i s o r s Navigating Big Data business analytics Helena Schwenk A special report prepared for Actuate May 2013 This report is the third in a series and focuses principally on explaining what
Advanced Analytics. The Way Forward for Businesses. Dr. Sujatha R Upadhyaya
Advanced Analytics The Way Forward for Businesses Dr. Sujatha R Upadhyaya Nov 2009 Advanced Analytics Adding Value to Every Business In this tough and competitive market, businesses are fighting to gain
perspective Progressive Organization
perspective Progressive Organization Progressive organization Owing to rapid changes in today s digital world, the data landscape is constantly shifting and creating new complexities. Today, organizations
Prescriptive Analytics. A business guide
Prescriptive Analytics A business guide May 2014 Contents 3 The Business Value of Prescriptive Analytics 4 What is Prescriptive Analytics? 6 Prescriptive Analytics Methods 7 Integration 8 Business Applications
The Next Wave of Data Management. Is Big Data The New Normal?
The Next Wave of Data Management Is Big Data The New Normal? Table of Contents Introduction 3 Separating Reality and Hype 3 Why Are Firms Making IT Investments In Big Data? 4 Trends In Data Management
Big Data Integration: A Buyer's Guide
SEPTEMBER 2013 Buyer s Guide to Big Data Integration Sponsored by Contents Introduction 1 Challenges of Big Data Integration: New and Old 1 What You Need for Big Data Integration 3 Preferred Technology
T r a n s f o r m i ng Manufacturing w ith the I n t e r n e t o f Things
M A R K E T S P O T L I G H T T r a n s f o r m i ng Manufacturing w ith the I n t e r n e t o f Things May 2015 Adapted from Perspective: The Internet of Things Gains Momentum in Manufacturing in 2015,
VIEWPOINT. High Performance Analytics. Industry Context and Trends
VIEWPOINT High Performance Analytics Industry Context and Trends In the digital age of social media and connected devices, enterprises have a plethora of data that they can mine, to discover hidden correlations
BI STRATEGY FRAMEWORK
BI STRATEGY FRAMEWORK Overview Organizations have been investing and building their information infrastructure and thereby accounting to massive amount of data. Now with the advent of Smart Phones, Social
At a recent industry conference, global
Harnessing Big Data to Improve Customer Service By Marty Tibbitts The goal is to apply analytics methods that move beyond customer satisfaction to nurturing customer loyalty by more deeply understanding
Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank
Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank Agenda» Overview» What is Big Data?» Accelerates advances in computer & technologies» Revolutionizes data measurement»
Getting Started Practical Input For Your Roadmap
Getting Started Practical Input For Your Roadmap Mike Ferguson Managing Director, Intelligent Business Strategies BA4ALL Big Data & Analytics Insight Conference Stockholm, May 2015 About Mike Ferguson
Sources: Summary Data is exploding in volume, variety and velocity timely
1 Sources: The Guardian, May 2010 IDC Digital Universe, 2010 IBM Institute for Business Value, 2009 IBM CIO Study 2010 TDWI: Next Generation Data Warehouse Platforms Q4 2009 Summary Data is exploding
Big Data and Healthcare Payers WHITE PAPER
Knowledgent White Paper Series Big Data and Healthcare Payers WHITE PAPER Summary With the implementation of the Affordable Care Act, the transition to a more member-centric relationship model, and other
and Analytic s i n Consu m e r P r oducts
Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.988.7900 F.508.988.7881 www.idc-mi.com Creating Big O p portunities with Big Data and Analytic s i n Consu m e r P r oducts W H I T E
Extend the value of your core business systems.
Legacy systems renovation to SOA September 2006 Extend the value of your core business systems. Transforming legacy applications into an SOA framework Page 2 Contents 2 Unshackling your core business systems
Delivering new insights and value to consumer products companies through big data
IBM Software White Paper Consumer Products Delivering new insights and value to consumer products companies through big data 2 Delivering new insights and value to consumer products companies through big
Supply Chains: From Inside-Out to Outside-In
Supply Chains: From Inside-Out to Outside-In Table of Contents Big Data and the Supply Chains of the Process Industries The Inter-Enterprise System of Record Inside-Out vs. Outside-In Supply Chain How
Data Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
Mohan Sawhney Robert R. McCormick Tribune Foundation Clinical Professor of Technology Kellogg School of Management [email protected].
Mohan Sawhney Robert R. McCormick Tribune Foundation Clinical Professor of Technology Kellogg School of Management [email protected] Transportation Center Business Advisory Committee Meeting
OPTIMUS SBR. Optimizing Results with Business Intelligence Governance CHOICE TOOLS. PRECISION AIM. BOLD ATTITUDE.
OPTIMUS SBR CHOICE TOOLS. PRECISION AIM. BOLD ATTITUDE. Optimizing Results with Business Intelligence Governance This paper investigates the importance of establishing a robust Business Intelligence (BI)
Apache Hadoop: The Big Data Refinery
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
Reaping the Rewards of Big Data
Reaping the Rewards of Big Data TABLE OF CONTENTS INTRODUCTION: 2 TABLE OF CONTENTS FINDING #1: BIG DATA PLATFORMS ARE ESSENTIAL FOR A MAJORITY OF ORGANIZATIONS TO MANAGE FUTURE BIG DATA CHALLENGES. 4
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
FIVE INDUSTRIES. Where Big Data Is Making a Difference
FIVE INDUSTRIES Where Big Data Is Making a Difference To understand how Big Data can transform businesses, we have to understand its nature. Although there are numerous definitions of Big Data, many will
Analyzing Big Data: The Path to Competitive Advantage
White Paper Analyzing Big Data: The Path to Competitive Advantage by Marcia Kaplan Contents Introduction....2 How Big is Big Data?................................................................................
Best Practices Brochure. Best Practices for Optimizing Social CRM Maximizing the Value of Customer Relationships. Customer Care
Best Practices Brochure Best Practices for Optimizing Social CRM Maximizing the Value of Customer Relationships Customer Care Social CRM Companies must do more than participate in today s social environment.
Big Data Challenges and Success Factors. Deloitte Analytics Your data, inside out
Big Data Challenges and Success Factors Deloitte Analytics Your data, inside out Big Data refers to the set of problems and subsequent technologies developed to solve them that are hard or expensive to
Business Intelligence and Big Data Analytics: Speeding the Cycle from Insights to Action Four Steps to More Profitable Customer Engagement
white paper Business Intelligence and Big Data Analytics: Speeding the Cycle from Insights to Action Four Steps to More Profitable Customer Engagement»» Summary For business intelligence analysts the era
The Future of Business Analytics is Now! 2013 IBM Corporation
The Future of Business Analytics is Now! 1 The pressures on organizations are at a point where analytics has evolved from a business initiative to a BUSINESS IMPERATIVE More organization are using analytics
A Hurwitz white paper. Inventing the Future. Judith Hurwitz President and CEO. Sponsored by Hitachi
Judith Hurwitz President and CEO Sponsored by Hitachi Introduction Only a few years ago, the greatest concern for businesses was being able to link traditional IT with the requirements of business units.
ANALYTICS STRATEGY: creating a roadmap for success
ANALYTICS STRATEGY: creating a roadmap for success Companies in the capital and commodity markets are looking at analytics for opportunities to improve revenue and cost savings. Yet, many firms are struggling
Management Consulting Systems Integration Managed Services WHITE PAPER DATA DISCOVERY VS ENTERPRISE BUSINESS INTELLIGENCE
Management Consulting Systems Integration Managed Services WHITE PAPER DATA DISCOVERY VS ENTERPRISE BUSINESS INTELLIGENCE INTRODUCTION Over the past several years a new category of Business Intelligence
Mergers and Acquisitions: The Data Dimension
Global Excellence Mergers and Acquisitions: The Dimension A White Paper by Dr Walid el Abed CEO Trusted Intelligence Contents Preamble...............................................................3 The
Government Technology Trends to Watch in 2014: Big Data
Government Technology Trends to Watch in 2014: Big Data OVERVIEW The federal government manages a wide variety of civilian, defense and intelligence programs and services, which both produce and require
Data Integration for the Real Time Enterprise
Executive Brief Data Integration for the Real Time Enterprise Business Agility in a Constantly Changing World Overcoming the Challenges of Global Uncertainty Informatica gives Zyme the ability to maintain
Big Data Comes of Age: Shifting to a Real-time Data Platform
An ENTERPRISE MANAGEMENT ASSOCIATES (EMA ) White Paper Prepared for SAP April 2013 IT & DATA MANAGEMENT RESEARCH, INDUSTRY ANALYSIS & CONSULTING Table of Contents Introduction... 1 Drivers of Change...
Big Data-Challenges and Opportunities
Big Data-Challenges and Opportunities White paper - August 2014 User Acceptance Tests Test Case Execution Quality Definition Test Design Test Plan Test Case Development Table of Contents Introduction 1
Research Note What is Big Data?
Research Note What is Big Data? By: Devin Luco Copyright 2012, ASA Institute for Risk & Innovation Keywords: Big Data, Database Management, Data Variety, Data Velocity, Data Volume, Structured Data, Unstructured
5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014
5 Keys to Unlocking the Big Data Analytics Puzzle Anurag Tandon Director, Product Marketing March 26, 2014 1 A Little About Us A global footprint. A proven innovator. A leader in enterprise analytics for
NAVIGATING THE BIG DATA JOURNEY
Making big data come alive NAVIGATING THE BIG DATA JOURNEY Big Data and Hadoop: Moving from Strategy to Production London Dublin Mumbai Boston New York Atlanta Chicago Salt Lake City Silicon Valley (650)
JOURNAL OF OBJECT TECHNOLOGY
JOURNAL OF OBJECT TECHNOLOGY Online at www.jot.fm. Published by ETH Zurich, Chair of Software Engineering JOT, 2008 Vol. 7, No. 8, November-December 2008 What s Your Information Agenda? Mahesh H. Dodani,
How to Enhance Traditional BI Architecture to Leverage Big Data
B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...
white paper Big Data for Small Business Why small to medium enterprises need to know about Big Data and how to manage it Sponsored by:
white paper Big Data for Small Business Why small to medium enterprises need to know about Big Data and how to manage it Sponsored by: Big Data is the ability to collect information from diverse sources
Targeting. 5 Tenets. of Modern Marketing
5 Tenets of Modern Marketing Targeting The foundation of any effective Modern Marketing effort is to ensure you have a clear and accurate picture of your potential customers. Without the proper strategies
Self-Service Big Data Analytics for Line of Business
I D C A N A L Y S T C O N N E C T I O N Dan Vesset Program Vice President, Business Analytics and Big Data Self-Service Big Data Analytics for Line of Business March 2015 Big data, in all its forms, is
Ten Mistakes to Avoid
EXCLUSIVELY FOR TDWI PREMIUM MEMBERS TDWI RESEARCH SECOND QUARTER 2014 Ten Mistakes to Avoid In Big Data Analytics Projects By Fern Halper tdwi.org Ten Mistakes to Avoid In Big Data Analytics Projects
Cloudera Enterprise Data Hub in Telecom:
Cloudera Enterprise Data Hub in Telecom: Three Customer Case Studies Version: 103 Table of Contents Introduction 3 Cloudera Enterprise Data Hub for Telcos 4 Cloudera Enterprise Data Hub in Telecom: Customer
Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap
Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed
End Small Thinking about Big Data
CITO Research End Small Thinking about Big Data SPONSORED BY TERADATA Introduction It is time to end small thinking about big data. Instead of thinking about how to apply the insights of big data to business
Big Data. Fast Forward. Putting data to productive use
Big Data Putting data to productive use Fast Forward What is big data, and why should you care? Get familiar with big data terminology, technologies, and techniques. Getting started with big data to realize
Taking A Proactive Approach To Loyalty & Retention
THE STATE OF Customer Analytics Taking A Proactive Approach To Loyalty & Retention By Kerry Doyle An Exclusive Research Report UBM TechWeb research conducted an online study of 339 marketing professionals
International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May-2014 18 ISSN 2278-7763. BIG DATA: A New Technology
International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May-2014 18 BIG DATA: A New Technology Farah DeebaHasan Student, M.Tech.(IT) Anshul Kumar Sharma Student, M.Tech.(IT)
Point of View: FINANCIAL SERVICES DELIVERING BUSINESS VALUE THROUGH ENTERPRISE DATA MANAGEMENT
Point of View: FINANCIAL SERVICES DELIVERING BUSINESS VALUE THROUGH ENTERPRISE DATA MANAGEMENT THROUGH ENTERPRISE DATA MANAGEMENT IN THIS POINT OF VIEW: PAGE INTRODUCTION: A NEW PATH TO DATA ACCURACY AND
Social Business Intelligence For Retail Industry
Actionable Social Intelligence SOCIAL BUSINESS INTELLIGENCE FOR RETAIL INDUSTRY Leverage Voice of Customers, Competitors, and Competitor s Customers to Drive ROI Abstract Conversations on social media
www.pwc.com/oracle Next presentation starting soon Business Analytics using Big Data to gain competitive advantage
www.pwc.com/oracle Next presentation starting soon Business Analytics using Big Data to gain competitive advantage If every image made and every word written from the earliest stirring of civilization
Trustworthiness of Big Data
Trustworthiness of Big Data International Journal of Computer Applications (0975 8887) Akhil Mittal Technical Test Lead Infosys Limited ABSTRACT Big data refers to large datasets that are challenging to
Big Data. Dr.Douglas Harris DECEMBER 12, 2013
Dr.Douglas Harris DECEMBER 12, 2013 GOWTHAM REDDY Fall,2013 Table of Contents Computing history:... 2 Why Big Data and Why Now?... 3 Information Life-Cycle Management... 4 Goals... 5 Information Management
Cloud Computing on a Smarter Planet. Smarter Computing
Cloud Computing on a Smarter Planet Smarter Computing 2 Cloud Computing on a Smarter Planet As our planet gets smarter more instrumented, interconnected and intelligent the underlying infrastructure needs
Why Big Data Analytics?
An ebook by Datameer Why Big Data Analytics? Three Business Challenges Best Addressed Using Big Data Analytics It s hard to overstate the importance of data for businesses today. It s the lifeline of any
BANKING ON CUSTOMER BEHAVIOR
BANKING ON CUSTOMER BEHAVIOR How customer data analytics are helping banks grow revenue, improve products, and reduce risk In the face of changing economies and regulatory pressures, retail banks are looking
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Beyond Watson: The Business Implications of Big Data
Beyond Watson: The Business Implications of Big Data Shankar Venkataraman IBM Program Director, STSM, Big Data August 10, 2011 The World is Changing and Becoming More INSTRUMENTED INTERCONNECTED INTELLIGENT
Reimagining Business with SAP HANA Cloud Platform for the Internet of Things
SAP Brief SAP HANA SAP HANA Cloud Platform for the Internet of Things Objectives Reimagining Business with SAP HANA Cloud Platform for the Internet of Things Connect, transform, and reimagine Connect,
How To Use Big Data Effectively
Why is BIG Data Important? March 2012 1 Why is BIG Data Important? A Navint Partners White Paper May 2012 Why is BIG Data Important? March 2012 2 What is Big Data? Big data is a term that refers to data
BIG DATA + ANALYTICS
An IDC InfoBrief for SAP and Intel + USING BIG DATA + ANALYTICS TO DRIVE BUSINESS TRANSFORMATION 1 In this Study Industry IDC recently conducted a survey sponsored by SAP and Intel to discover how organizations
How To Understand The Benefits Of Big Data
Findings from the research collaboration of IBM Institute for Business Value and Saïd Business School, University of Oxford Analytics: The real-world use of big data How innovative enterprises extract
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
How To Create A Social Media Management System
Best Practices Brochure Best Practices for Optimizing Social CRM Maximizing the Value of Customer Relationships Social CRM Companies must do more than participate in today s social environment. They must
The Principles of the Business Data Lake
The Principles of the Business Data Lake The Business Data Lake Culture eats Strategy for Breakfast, so said Peter Drucker, elegantly making the point that the hardest thing to change in any organization
Big Data and Analytics in Government
Big Data and Analytics in Government Nov 29, 2012 Mark Johnson Director, Engineered Systems Program 2 Agenda What Big Data Is Government Big Data Use Cases Building a Complete Information Solution Conclusion
What to Look for When Selecting a Master Data Management Solution
What to Look for When Selecting a Master Data Management Solution What to Look for When Selecting a Master Data Management Solution Table of Contents Business Drivers of MDM... 3 Next-Generation MDM...
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
BIG Data. An Introductory Overview. IT & Business Management Solutions
BIG Data An Introductory Overview IT & Business Management Solutions What is Big Data? Having been a dominating industry buzzword for the past few years, there is no contesting that Big Data is attracting
The Evolution of Enterprise Social Intelligence
The Evolution of Enterprise Social Intelligence Why organizations must move beyond today s social media monitoring and social analytics to Social Intelligence- where social media data becomes actionable
A Strategic Approach to Unlock the Opportunities from Big Data
A Strategic Approach to Unlock the Opportunities from Big Data Yue Pan, Chief Scientist for Information Management and Healthcare IBM Research - China [contacts: [email protected] ] Big Data or Big Illusion?
Enterprise Data Quality
Enterprise Data Quality An Approach to Improve the Trust Factor of Operational Data Sivaprakasam S.R. Given the poor quality of data, Communication Service Providers (CSPs) face challenges of order fallout,
Big Data Analytics- Innovations at the Edge
Big Data Analytics- Innovations at the Edge Brian Reed Chief Technologist Healthcare Four Dimensions of Big Data 2 The changing Big Data landscape Annual Growth ~100% Machine Data 90% of Information Human
Master big data to optimize the oil and gas lifecycle
Viewpoint paper Master big data to optimize the oil and gas lifecycle Information management and analytics (IM&A) helps move decisions from reactive to predictive Table of contents 4 Getting a handle on
How To Use Social Media To Improve Your Business
IBM Software Business Analytics Social Analytics Social Business Analytics Gaining business value from social media 2 Social Business Analytics Contents 2 Overview 3 Analytics as a competitive advantage
Presented By: Leah R. Smith, PMP. Ju ly, 2 011
Presented By: Leah R. Smith, PMP Ju ly, 2 011 Business Intelligence is commonly defined as "the process of analyzing large amounts of corporate data, usually stored in large scale databases (such as a
Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services
Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012 Viswa Sharma Solutions Architect Tata Consultancy Services 1 Agenda What is Hadoop Why Hadoop? The Net Generation is here Sizing the
Nokia Siemens Networks Network management to service management - A paradigm shift for Communications Service Providers
White paper Nokia Siemens Networks Network management to service management - A paradigm shift for Communications Service Providers Service management solutions enable service providers to manage service
Generating analytics impact for a leading aircraft component manufacturer
Case Study Generating ANALYTICS Impact Generating analytics impact for a leading aircraft component manufacturer Client Genpact solution Business impact A global aviation OEM and services major with a
