D2.3.1 First Draft of Sector s Requisites

Transcription

1 Project Acronym: BIG Project Title: Big Data Public Private Forum (BIG) Project Number: Instrument: CSA Thematic Priority: ICT D2.3.1 First Draft of Sector s Requisites Work Package: WP2 Strategy & Operations Due Date: 30/04/2013 Submission Date: 22/05/2013 Start Date of Project: 01/10/2012 Duration of Project: 26 Months Organisation Responsible of Deliverable: Siemens Version: 1.2 Status: Author name(s): Reviewer(s): Final Sonja Zillner (Chapter Health) Sebnem Rusitschka (Chapters Energy, Transport) Ricard Munné (Chapter Public) Helen Lippell (Chapter Telco & Media) Felicia Lobillo Vilela (Chapter Telco & Media) Kazim Hussain (Chapter Finance & Insurance) Tilman Becker (Chapter Manufacturing) Ralf Jung (Chapter Retail) Denise Paradowsk (Chapter Retail) Yi Huang (Edition) Sabrina Neururer (Chapter Health) John Domingue (Chapter Public) Michael Hausenblas (Chapter Telco & Media) Pedro Soria Rodriguez (Chapter Finance & Insurance) Amar Djalil Mezaour (Chapters Energy, Manufacturing, Retail, Siemens AG Siemens AG ATOS PA ATOS ATOS DFKI DFKI DFKI Siemens AG UIBK STI NUIG/DERI ATOS EXALEAD BIG consortium Page 1 of 162

2 Transport) Tilman Becker (Final Review) DFKI Nature: R Report P Prototype D Demonstrator O - Other Dissemination level: PU - Public CO - Confidential, only for members of the consortium (including the Commission) RE - Restricted to a group specified by the consortium (including the Commission Services) Project co-funded by the European Commission within the Seventh Framework Programme ( ) BIG consortium Page 2 of 162

3 Revision history Version Date Modified by Comments /03/2013 Sonja Zillner, TOC provided Sebnem Rusitschka (Siemens AG) /05/2013 All authors Master version: aggregate the single chapters /05/2013 Tilman Becker (DFKI) Quality control /05/2013 Tilman Becker (DFKI) Integrated final sector review (retail, manufacturing) BIG consortium Page 3 of 162

4 Copyright 2012, BIG Consortium The BIG Consortium ( grants third parties the right to use and distribute all or parts of this document, provided that the BIG project and the document are properly referenced. THIS DOCUMENT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. BIG consortium Page 4 of 162

5 Executive Summary The overall objective of the activities of the sector forums in the BIG project is to acquire a deep understanding of how big data technology can be used in the various industrial sectors, such as healthcare, public, finance and insurance, telecom, media & entertainment, manufacturing, retail, energy and transport. Within this document, we will describe for each sector the particular needs and requirements of the industry, discuss the possible market impact, related drivers and constraints. We will identify the stakeholder of the industry and highlight their roles and interests, discuss available data sources for big data application and will finally describe some concrete big data application domains. This first version of the Sector requisites will evolve into a more focused and pointed document, during the next twelve months through continued interviews with stakeholder and decision makers in the respective industries. The snapshot of the current key-findings of the various sectors can preliminarily be summarized as follows: Several developments in the healthcare sector, such as escalating healthcare cost, increased need for healthcare coverage and shifts in provider reimbursement trends, trigger the demand for big data technology. In addition, the availability and access of health data is continuously improving, the required big data technology, such as advanced data integration and analytics technologies, are in place, and first-mover best-practice applications demonstrate the potential of big data technology in healthcare. In a nutshell, the big-data revolution in the healthcare domain is in a very early stage with the most potential for value creation and business development unclaimed as well as unexplored. Current roadblocks are the established system incentives of the healthcare system which hinder collaboration and, thus, data sharing and exchange. The trend towards value-based healthcare delivery will foster the collaboration of stakeholder to enhance the value of the patient s treatment, and thus will significantly foster the need for big data applications. The public sector is facing some important challenges today, the lack of productivity compared to other activities, current budgetary constraints, and other structural problems due to the aging population that will lead an increasing demand for medical and social services, and a foreseen lack of a young workforce in the future. The public sector is increasingly aware of the potential value to be gained from Big Data, as it may provide vast improvements in effectiveness and in efficiency besides new tools with automated analytics for processing large amounts of data. Governments generate and collect vast quantities of data through their everyday activities, such as managing pensions and allowance payments, tax collection, National Health System patient care, recording traffic data and issuing official documents. This data is produced in many formats, textual and numerical are the most predominant, but also in other multimedia formats for specific duties the sector has entrusted. However, the path for success is not clear due to some uncertainties, to name a few: Big Data technology is immature; there is a lack of skilled people; uncertainties about new EC directives for data protection and Public Sector Information. The finance and insurance sector has been an intensively data driven industry for many years, with areas such as capital market trading for example, having relied on various forms of data analytics for some time. However, in the midst of the changing landscape of the industry, business leaders are recognising the commercial and competitive advantage to be gained by leveraging new advancements in Big Data technologies. Some opportunities for Big Data within Finance and Insurance have also been born out of necessity largely thanks to the introduction of stringent regulatory requirements and the digitisation of financial products and services. Big Data in the industry can be used to drive very targeted marketing campaigns by analysing customer transactional data, early fraud detection and predictive credit risk modelling to name a few. Some challenges lie in being able to firstly identify and then utilise data of value currently held across organisational silos and managed in tools BIG consortium Page 5 of 162

6 which make use of relatively primitive database technology. An observation can also be made for the lack of skills in the market which can effectively bridge the gap between the Data Scientist and the Business Manager. It is clear the industry is embracing Big Data technologies; though there is much scope for progression. The telecom sector seems to be convinced of the potential of Big Data Technologies. The combination of benefits within different telecom domains (marketing and offer management, customer relationship, service deployment and operations, etc.) can be summarised as the achievement of the operational excellence for telco players. However, there are still challenges that need to be addressed before Big Data is generally adopted. Big Data can only work out if a business puts a well-defined data strategy in place before it starts collecting and processing information. Obviously, investment in technology requires a strategy to use it according to commercial expectations; otherwise, it is better to keep current systems and procedures. Most generally, the required strategy might imply deep changes in business processes that must also be carried out. The challenge is that operators have just not taken the time to decide what this strategy should take them, probably due to the current economic situation, which leads to shorter term decisions. This might be the origin of the Data as a Service trend some operators are following and consisting on providing companies and public sector organisations with analytical insights that enable these third parties to become more effective. This involves the development of a range of products and services using different data sets, including machine to machine data and anonymised and aggregated mobile network customer data. The media and entertainment industries have changed at an unprecedented rate over the last few years since the advent of the World Wide Web, and the proliferation of advanced mobile technologies. Traditional mass media organisations are no longer the primary gatekeepers of news and culture. They are having to compete to maintain influence and authority in a complex landscape of social media, multiple channels and devices, and reduced barriers to entry for anyone wishing to broadcast, blog and communicate. In effect, traditional media needs to move to B2C type behaviours to better understand and serve their customers. Companies may also need to reshape themselves in order to participate in all aspects of the media landscape. Media remains a hugely important component of the economy in 2011 the EU newspaper and news media sector alone generated revenue of 36 billion Euros. The Big Data revolution has been called the quantified economy and will have an impact across every economic sector. This creates both opportunities and challenges for companies of every size. Data processing at a scale that was unimaginable only a few years ago is now within the reach of more than just the biggest players. However, it cannot be assumed that technology vendors, media firms, customers and regulators are ready to embrace the change. The 3 V s of Big Data Volume, Veracity and Variety, are a well-known axiom for the Big Data ecosystem. To these may be added Variability, Veracity, Visibility and Value. Publishers and media organisations have always been heavy consumers and producers of data, but digital technologies have increased both the complexity and potential opportunities for exploitation. Within companies, Customer Relationship Management systems, analytics applications and social media engagement tools can capture significant amounts of data about service users and their behaviour. But ease of extracting actionable insight from data remains an issue. The energy sector is potentially well-suited to be mined with big data technologies. The industry has already many metering and measuring points to run its interconnected power network spanning all of Europe. Innumerable supply source, facilities and plants, transportation nodes, other consumption points are being now increasingly measured near real-time. There is a possible new market built around this energy data. However, BIG consortium Page 6 of 162

7 uncertainties such as privacy and confidentiality issues seem to hold the market in paralysis in a lot of European countries. Compared to US there are not even a noticeable number of start-ups built around this data, although many European pilots and studies show that there is market potential. Another interesting point is in the really big data applications such as with smart metering the business value is literally scattered over all the stakeholders, and non really seem to be the one responsible for taking on the regulatory uncertainties. On the other hand there are a number of markets within the energy sector, such as oil and gas, which are inherently big data. It has become a habit to collect more data the more sensing technologies, i.e. so-called data generators, are available. Only recently these markets also realize that the wells of data can be operationalized. The transport sector has many facets and can be analyzed along the axis of mode, i.e. road, rail, air, etc., elements comprising the sector, namely infrastructure, vehicles, and operation, and the function of transportation, which either is freight transport or passenger transport. There are also further segmentation possibilities into medium distances, long distances and urban transport. Depending on the size of the area and the involved actors, it is conceivable that urban multimodal transportation is one of the more complex big data settings in the transportation sector especially because it involves end users as well as multiple businesses exchanging data across their organizational boundaries. Especially with the wide spread of smart phones and location-based value-added services already available to end users, the perfect trip seems increasingly feasible with big data technologies. Urban multimodal transportation needs to satisfy constraints such as being comfortable to use and offering the most economical or green way to travel according to the personal preferences and location-based options, including public transportation, rail, car-sharing, or cycling for example. The better connected all of the elements are the more feasible the case of multimodal transport becomes. As with all the other industrial sectors the interconnectedness is achieved by sensors and devices enabling data exchange, especially location data. Within the urban setting open data can be an additional and important driver of big data ecosystems for the transport sector. In the manufacturing Sector, Big Data is part of a fourth industrial revolution; named Industry 4.0 that sees a transformation of the manufacturing processes to integrated Cyber-Physical systems. Integration becomes a requirement in vertical as well as horizontal direction. For Europe, the challenge will be to stay or become market leaders and at the same time establish Europe as a leading market for Big Data technology in the manufacturing sector as integrated part of Industry 4.0. Specifically in the manufacturing sector, adapted interfaces to Big Data and predictive data analysis methods will be needed. The retail sector faced with a growing amount of data and the availability of heterogeneous data sources. New marketing possibilities and business models, such as electronic commerce and mobile commerce show the potential for changing requirements and expectations of the new generation of consumers. Especially traditional retailers with their physical hypermarkets have to rethink their business models. New technology trends such as Internet of Things & Services can be used to attract new customers by creating new marketing concepts and business strategies. Classic metrics such as inventory information (e.g. Stock Keeping Units) or Point of Sale analysis (e.g. which products have been sold and when) still play an important role, but the knowledge about the customers becomes more and more important. The potential of tailored and personalized customer communication, so called Precision Retailing is one of the hot topics for marketing experts. The identification and understanding of customer needs and behaviour require the collection, processing and analysis large amounts of data from different sources. Our preliminary analysis of the key findings across the sectors indicates that it is important to distinguish the technical from the business perspective. From a technological perspective, big data applications represent an evolutionary step. Big data technologies, such as decentralized networking and distributed computing for scalable data storage and scalable data analytics, semantic technologies and ontologies, machine learning, natural language processing and other BIG consortium Page 7 of 162

8 data mining techniques have been the focus of research projects for many years. Now these techniques are being combined and extended to address the technical challenge of the socalled big data paradigm. However, when analysing the business perspective, it becomes clearer that big data applications have a revolutionary sometimes even disruptive impact on the existing industrial business-as-usual practices. If thought through: New players emerge that are better suited to offer service based on mass data. Underlying business processes change fundamentally. For instance in the healthcare domain, big data technologies can be used to produce new insight about the effectiveness of treatments and this knowledge can be used to increase quality of care. However, in order to foster such big data applications, the industry requires new reimbursement models that reward the quality instead of quantity of treatments. Similar changes are required in the energy industry: energy usage data from end users would have to share beyond organizational boundaries of all stakeholders such as energy retailers, distribution network operators, and new players such as demand response providers and aggregators, energy efficiency service providers. But who is to invest in the technologies that would harvest the energy data in the first place. New participatory business value networks are required instead of static value chains. Within all industries the 3 V s of Big Data, such as Volume, Velocity and Variety, have been of relevance. In addition, industrial sectors that are already reviewing themselves in the light of the big data era, add further V s to reflect their specific aspects and to adapt the big data paradigm to their particular needs. Many of those extensions, such as data privacy, data quality, data confidentially, etc. address the challenge of data governance, some of the added extensions, such as business value, even address the fact that the potential business value of big data applications is yet unexplored and not well understood. In addition, within all industrial sectors, it became clear that not the availability of technology but the lack of business cases and business models is hindering the implementation of big data applications. Usually, a business case needs to be clearly defined and convincing before new applications are initiated. However, in the context of big data applications, the development of concrete business case is as inferred a very challenging task. This is due to two reasons. First, as the impact of big data applications relies on the aggregation of not only one but a large variety of heterogeneous data sources beyond organisational boundaries, the effective cooperation of multiple stakeholders with potentially diverging or at first orthogonal interests is required. Thus, the stakeholders individual interests and constraints which in addition are quite often moving targets need to be reflected within the business case. Secondly, existing approaches for developing business models and business cases usually focus on single organisations and do not provide guidance for dynamic networks of multiple stakeholders. Although, many of the established market players are currently struggling in developing concrete business cases, one can observe that at the same time more and more start-ups offering big data applications are entering the market. One could now assume that start-ups are more talented in developing business cases. However, this would be a simplified assumption, as the majority of start-ups offer niche solutions that rely on a small number of data sources, which again makes the stakeholder governance and thus the business case development much easier. Finally, it is important to mention that we could identify within nearly each industry some blueprints of successfully implemented business ecosystems that are generating and exploiting value by analyzing (manual and automatic) large sets of heterogeneous and complex data. Those best practice scenarios could get rid of both, the data as well as the organisational silos. In order to identify suitable governance structures as well as recommended interventions that foster the development of big data applications ecosystems, we plan to explore the identified best practice solutions in further detail. BIG consortium Page 8 of 162

9 Table of Contents Executive Summary Health Introduction Definition of Big Data in Healthcare Sector Benefits, Advantages & Impact Industrial Background Characteristics of the European Healthcare Industry Market Impact and Competition Stakeholders: Roles and Interest Available Data Sources Drivers and Constraints Challenges that need to be addressed Role of Regulation and Legislation Big Data Application Scenarios Comparative Effectiveness Research Clinical Decision Support Clinical Operation Intelligence Secondary Usage of Health Data Public Health Analytics Patient Engagement Application Conclusion and Next Steps Abbreviations and acronyms References Public Sector Introduction Industrial Background Characteristics of the European Public Sector Stakeholders: Roles and Interest Available Data Sources Nature of the data Big Data in the public sector Big Data benefits in the public sector Driver and Constraints Challenges that need to be addressed Role of regulation and legislation Big Data Application Scenarios Monitoring and supervision of on-line gambling operators Operative efficiency in Labour Agency Conclusion and Next Steps Abbreviations and acronyms References Finance and Insurance BIG consortium Page 9 of 162

10 3.1. Introduction Definition of Big data in Finance Benefits, Advantages and Impact Stakeholders: Roles and Interest Drivers of Big Data in Financial Services and Insurance Challenges that need to be addressed Role of Regulation and Legislation Big Data Application Scenarios Conclusion and Next Steps Telco, Media & Entertainment Introduction Industrial Background Telecom Media and Entertainment Big Data Application Scenarios Telecom Media and Entertainment Requirements Telecom Media and Entertainment Conclusion Telecom Media and Entertainment Abbreviations and acronyms References Telecom sector Media and entertainment sectors Retail, Transport, Manufacturing Introduction Industrial Background Characteristics of the European Retail, Transport, and Manufacturing Sectors Market Impact and Competition Stakeholders: Roles and Interest Big Data Application Scenarios Big Data Requirements Data Acquisition Data Analysis Data Storage Data Curation Data Usage Conclusion and Next Steps Abbreviations and acronyms References BIG consortium Page 10 of 162

11 6. Energy Introduction Definition of Big Data in the Energy Sector Benefits, Advantages & Impact Industrial Background Characteristics of the European Energy Sector Market Impact and Competition Stakeholders: Roles and Interest Available Data Sources Drivers and Constraints Challenges that need to be addressed Role of regulation and legislation Big Data Application Scenarios Intelligent On Demand Reconfiguration of Power Networks Flexible Tariffs for Demand Side Management Big Data Requirements Data Acquisition Data Analysis Data Storage Data Curation Data Usage Conclusion and Next Steps References Annex a. Big Data questionnaire for public sector Annex b. Big Data questionnaire Annex c. Multiples of bytes BIG consortium Page 11 of 162

12 Index of Figures Figure 1. Public Sector Information Stakeholders in the PSI system (Correia, 2004) Figure 2. PSI distinction between administrative and non-administrative Figure 3. PSI distinction regarding its relevance Figure 4. PSI distinction according to its anonimity Figure 5. Key facts and headline-worthly figures (TechAmerica Foundation) Figure 6. Areas of improvement through Big data usage in Public Sector Figure 7. Social media data is not always relable, but some times it is (Cottingham, 2010) Figure 8. How aware is your Organization of Big Data business opportunities? Figure 9. In your opinion, what benefits for your organization will the use of Big Data have?.. 43 Figure 10. What data do you think would be valuable to collect for your Big Data strategy? Figure 11. Areas of improvement of public services in the U.S.(TechAmerica Foundation) Figure 12. What are the most important key challenges you would face for adopting Big Data? Figure 13: Fragmentation in the telecom market Figure 14: Telecom landscape first approach for identification of actors Figure 15: etom-based identification of players Figure 16: etom SID model Figure 17: Detailed data classification in etom SID model Figure 18: Level of data complexity vs. business value Figure 19: Most benefitted company departments according to executives (different sectors).. 72 Figure 20: Infrastructure readiness for main Big Data technical areas according to executives (different sectors) - Source Economist Intelligence Unit survey, March Figure 21: What do you think is the biggest opportunity that Big Data presents operators with? - Source European Communications Magazine survey, March Figure 22: Is Big Data a strategic goal in your organisation currently? - Source European Communications Magazine survey, March Figure 23: How well do you think you collate and analyse the data in your possession currently? - Source European Communications Magazine survey, March Figure 24: How aware is your company of Big Data business opportunities? BIG survey Figure 25: What are the most important key challenges you would face for adopting Big Data? BIG survey Figure 26: In your opinion, which areas in your company would benefit the most from Big Data? BIG survey Figure 27: Which of this data are you already collecting and which one do you plan to collect? BIG survey Figure 28: How much does your company intend to invest in Big Data R&D in ? BIG survey Figure 29: Do you believe that Big Data should be a key strategic priority for operators? - Source European Communications Magazine survey, March Figure 30: Big Data and etom Figure 31:Using Big Data to deliver on Defined Business Objectives (survey) Figure 32: What do you think is the biggest barrier for operators executing Big Data Strategy? - Source European Communications Magazine survey, March Figure 33 Illustration of a sustainable connected city stakeholders (Rusitschka, 2010) for energy-efficient multimodal transportation in the last mile Figure 34 - Big data in the energy sector needs generalists and specialists that grasp the intersection of Energy, Informatics, and Data Science Figure 35 Gartner Magic Quadrant for Meter Data Management Software (Sumic, 2013) Figure 36 Through deregulation and business innovation energy data stakeholder group becomes larger which requires sophisticated data sharing policies or completely new technologies. Source: German BMWi Project E-DeMa Figure 37 - The interconnected pan-european power network is a complex system that could profit greatly through real-time on-demand situational awareness using big data BIG consortium Page 12 of 162

13 Index of Tables Table 1. PSI stakeholders Table 2. General Big Data Requirements (Telecom sector) Table 3. Big Data Requirements for the Marketing, Product and Customer domain (Telecom sector) Table 4. Big Data Requirements for the Service domain (Telecom sector) Table 5. Big Data Requirements for the Resource domain (Telecom sector) Table 6. Big Data Requirements for the Supplier/partner domain (Telecom sector) Table 7. Big Data Requirements for the Social Media (Telecom sector) BIG consortium Page 13 of 162

14 1. Health 1.1. Introduction Definition of Big Data in Healthcare Sector What is Big Health Data? In this report, we use the term big health data (technology) to establish a holistic and broader concept whereby clinical, financial and administrative data as well as patient behavioural data, population data, medical device data, and any other related health data are combined and used for retrospective, real-time and predictive analysis. In this way, big health data technologies help to take existing healthcare Business Intelligence (BI), Health Data Analytics, Clinical Decision Support (CDS) as well as health data management application to the next level by providing means for the efficient handling and analysis of complex and large healthcare data by relying on data integration (multiple, heterogeneous data sources instead of one single data sources) real-time analysis (instead of benchmarking along predefined key performance indicators (KPIs)) predictive analysis (instead of retrospective analysis) Until now, the label big data in healthcare is -- compared to other industrial domains -- not so frequently used. Today, similar technology capabilities and, respectively, the associated functional opportunities are also referred as Advanced Health Analytics (e.g. see (Frost &Sullivan, 2012a)). What are the Characteristics of (Big) Health data? Why is health data a form of big data? This is not only because of its sheer volume but for its complexity, diversity and timeliness. Thus, the Bigness of health data can be characterized by the well-known Vs and its associated categories: Variety: Today s business intelligence and health data analytics application mainly rely on structured (very rare on unstructured data) mostly from a single as well internal data source. In future, big health data technologies will establish the basis to aggregate and analyze internal as well as external heterogeneous data that are integrated from multiple data sources. Volume: talking about volume, we need to distinguish structured and unstructured data: o Large volume structured health data are already present today, if for instance all related data sources of a network of health care providers get integrated: In US, the volume of data of integrated delivery networks (IDNs) can easily exceeds one petabyte. Due to the fact that in Europe, the integration of health data is in comparison to the US less advanced, the volume of health data is currently not indicated as urgent issue. o There exist various types of unstructured health data that encompass valuable content for gaining more insights about healthcare related questions and concerns, such as biometric data, genomic data, text data from clinical charts, and medical images. Information extraction technologies that allow transforming unstructured health data into semantic-based structured formats are the focus of BIG consortium Page 14 of 162

15 many research initiatives (see for instance (Seifert et.al, 2009) or (Meystre et al., 2008). With the availability of mature information extraction technology for the healthcare sector, the volume of unstructured data will eclipse the whole data volume requirements. Type of analytics: Today s business intelligence health data applications rely mostly on ex post focused KPIs. Future big health data applications will rely on data integration, complex statistical algorithms, event-based, real-time algorithm and advanced analytics, such as prediction and device. (Business) Value: Value addresses the challenge to generate business value out of the health data. One requires to identify the data sources and analytics algorithm that can be associated with a compelling business case that brings value to the involved stakeholders. Source: Stakeholder Interviews and (Frost &Sullivan, 2012a) (Groves et al., 2013) Benefits, Advantages & Impact Healthcare is a large and important segment of the overall economy that faces tremendous productivity challenges. In particular, there is a clear need of cost efficiency, of improved quality of care, and a need for broader coverage of healthcare services. Benefits and Advantages Big health data technologies can help to address those needs: they can be used to aggregate and analyze data from disparate sources in order to provide insights and guidance in the healthcare process that relies on a more complete and comprehensive view of individual patients and patients populations. For instance, the following (and many other) benefits and advantages can be realized: Improved efficiency of care: Clinical, financial and administrative data can be combined to monitor health outcomes in relation to the utilization of resources, such as medications, treatments, etc. The performance of physicians can be measured and compared against peers and other institutions. Health data of patient populations can be mined for clinical research questions. Through detailed information reporting applications, health provider organizations can improve their operational processes. Increased transparency about the effectiveness of clinical processes helps to improve the efficiency of care settings Improved quality of care: Users, such as clinicians and physicians, can access key knowledge that is needed for effective and informed decision making. High-risk patients and patient populations can be identified and subsequently benefit from proactive care or lifestyle changes. By aggregating patient and population data in uniform and multi-dimensional views, valuable insights about symptoms and disease patterns can be provided. Researchers can mine data to identify the most effective treatment for particular conditions. Source: Stakeholder Interviews, (Frost &Sullivan, 2012a), (McKinsey & Company, 2011) and (Groves et al., 2013) BIG consortium Page 15 of 162

16 Clinical Impact The healthcare domain requires improved efficiency as well as improved quality of care. Today, efficiency of care and quality of care are two opposing requirements: The more information is available about a patient health history and status, the more individualized the treatment decision can be made, which automatically leads to an increased quality of care. However, without big data technology, i.e. means for automatically analyzing large amounts of heterogeneous health data, improved quality of health care services will always lead to increased cost of care, as individualized treatment paths cannot be standardized and thus are likely to become very labour and cost-intensive. In order to address the mentioned shortcoming, the various dimension of health data, such as the clinical data describing the health status and history of patient the administrative and clinical process data the knowledge about diseases as well as related (analyzed) population data the knowledge about changes in time need to be incorporated in the automated health data analysis, in order to make the analysis effective. If the data analysis is restricted on only one dimension of data, for example the administrative and financial data, it will become possible to improve the already established management and reimbursement processes, however it will not be possible to identify new standards for individualized treatments. Hence, the highest clinical impact of big data approaches for healthcare domain can be achieved if data from the four dimensions are aggregated, compared and related. Doing so, big data technologies will help to produce new insights enabling more and more personalized treatments. Today it is common clinical practice to treat patients as some sort of average. Clinicians diagnose a disease or a condition as well as suggest a treatment by relying on knowledge, such as clinical studies, that describes findings that are working for the majority of people. The conventional double-blind studies, which are conducted to prove effectiveness and safety of treatments, usually rely on sample data sets representing patients with similar characteristics and do only rarely factor the differences between patients. However, with big data analytics, it becomes possible to segment the patients into groups and subsequently determine the differences between patient groups. Instead of asking the question Is the treatment effective? it becomes possible to answer the question For which patients is this treatment effective?. This shift from average-based towards individualized healthcare bears the potential to significantly improve the overall quality of care. Source: Stakeholder Interviews and (O Reilly et al., 2012) Financial Impact Due to the evolving nature of the big health data technological capabilities as well as the provided business value, it is difficult to have a clear definition of the big health data market and consequently its financial impact. Within our discussion and interviews about potential big health data applications, it became clear, that in the majority case of sketched applications the technical foundation was already available; however the business case was still missing or unclear. In general, big data applications are not isolated applications but cover the complete value chain of the healthcare setting. This leads to the situation that usually several stakeholders (such as insurances, hospital operators, clinicians, etc.) with opposing interests are involved. The successful implementation of big health data applications, thus, relies on a clear and convincing business case, on significant changes in the overall value chain and the interplay of stakeholders as well as on changes regarding the underlying incentives. BIG consortium Page 16 of 162

17 Because of this unclear market situation as well as fluctuating concepts and definition of big data products and services, quantitative revenue forecast are difficult to provide and (to the best of our knowledge) not available. However, some impressing estimates about the financial impact of big data applications in the healthcare sector in US are available and should be mentioned in this context: According to the Mc Kinsey Study (McKinsey & Company, 2011), big data applications have the potential to generate significantly financial value in the US Healthcare Sector. Their financial calculations are based on the assumption that the existing best practices of big data applications are emulated and implanted. In addition, it is assumed that large and comprehensive datasets will be analyzed in order to improve the effectiveness and efficiency of health care as entire system. Respectively, the discussed applications range from clinical operations, to administrative and financial processes, to knowledge discovery application in the Research and Development (R&D) domain, as well as public and governmental-driven applications for analyzing and improving population health. Moreover the calculation assumes, that required IT and datasets investments, analytical capabilities, privacy protection, and appropriate economic incentives are in place. With all those premises in place, McKinsey estimates that in about ten years time there is an opportunity to capture more than $300 billion 1 per year in new value, with two-thirds of that in the form of reductions to national health care expenditures. Source: (Frost &Sullivan, 2012a) and (McKinsey & Company, 2011) 1.2. Industrial Background Characteristics of the European Healthcare Industry Changing Patient Demographics European citizen live longer, and this quite often non-healthy. Thus, the percentage of European population being older than 65 years is steadily growing. Whereas in the year 2006, only 29,7 percent of the population was older than 65 years, the forecast for the year 2014 expect 31,9 percent of the population to be aged older than 65 years. The combination of more elderly people and changes in the lifestyle, such as smoking, physical inactivity, alcohol consumption, etc., is expected to lead to an increased risk of chronic diseases. Although, the average European citizen is expected to live longer, the time of European citizen being healthy does not increase. In other words, not only the absolute number of years of living will increase, but also the unproductive or non-healthy number of years of living, which again will enhance the demand for health care services significantly. Source: (Frost &Sullivan, 2012b) Increasing Healthcare Costs Healthcare costs in Europe are significantly increasing. For instance, the total expenditure of healthcare costs (public and private expenditure) grew from $1.511 million in 2006 to $2.359 million in 2011 which yields a 5 years Compound Annual Growth Rate (CAGR) of 9,3 %. Similarly, the GPD (Gross Domestic Product) increased from 16 percent in 2006 to 25 percent in 2011 which adds up to a 5 years CAGR of 9,5%. Several reasons are causing this quite significant rising of healthcare expenditure: With the increase in aged population, the share of tax-paying citizen reduces and, thus, the revenue of the government decreases. 1 With one billion being one thousand millions BIG consortium Page 17 of 162

18 There have been increased investments by the government and private companies to develop new drugs, techniques, equipments, and services. Due to the increase in chronic diseases, more people require life-long treatment and more hospital stays become necessary. Today s almost universal coverage of health service in Europe is accompanied with unequal social contributions which again has effects on the overall financial situation. Source: (Frost &Sullivan, 2012b) Technology Intensity The Healthcare Industry is technology-intensive as well as technology-driven. Within the years 2007 and 2010, the European healthcare technology industry grew more than 10%. In addition, the pace of healthcare technology development is quite fast. On average, it takes 18 to 24 months until an improved version of a product reaches the market. In addition, the number of filed patents confirms the high level of technology intensity. In only ten years more precisely between 2001 and 2012 the number of patents filed by the European Healthcare technology industry has doubled. Source: (Frost &Sullivan, 2012b) Regulated Market Healthcare Industry is a regulated market. Legislation and policies have a strong influence on the overall interplay of healthcare industry stakeholders, and thus also on the innovative power of the industry, as well as the quality and effectiveness of care settings. It is important to note that legislation differs from country to country. European politic has made some efforts to address the current challenges in the healthcare industry, such as the increasing aging population, changes in disease patterns, with changing lifecycles, the increased healthcare costs, the global health challenges and implications or the increasing inequality in receiving healthcare. The focus of the future healthcare policy is to addresses (amongst others) the following aspects: long-term care strengthening public health structure, performance, and efficiency disease prevention and social inclusion in care delivery Source: (Frost &Sullivan, 2011) and (Frost &Sullivan, 2012b) Market Impact and Competition Market Impact The market for big data technology in the healthcare domains is in an early stage. To the best of our knowledge, concrete financial numbers of the market impact are not available. In order to provide some insights into the market potential of big data technology in the healthcare domain, we would like to reference the Frost & Sullivan Study about U.S. Hospital Health Data Analytics Markets (Frost &Sullivan, 2012a) which provides market numbers and forecasts of a related as well as overlapping market segment, i.e. the U.S. Hospital Health Data analytics market: The study concludes that the U.S. Hospital Health Data analytics market is in an early stage, that the market is emerging market and has reached a saturation of 14% with increasing trend. BIG consortium Page 18 of 162

19 In addition, the study highlights that the adoption (rate) of health data analytics relies on the availability and thus the adoption (rate) of Electronic Health Records (EHRs). Due to the seed funding for EHR technology that was provided by the US health reform, the adoption of hospital EHR technology is currently significantly growing and expected to grow. While in 2011, only every third US hospital (35%) had already implemented some kind of EHR technology, it is expected that in 2016 nearly every US hospital (95%) will use EHR technology. This represents an increase of 171 percent and a CAGR of 22, percent. The adoption rate of health data analytics is strongly influenced by the adoption of EHR technology for two reasons: o o First, as any advanced health data analytics rely on integrated data sets, hospitals focus first on the EHR implementation. And second, due to the required investments for EHR implementations, hospitals often need to postpone any other investments, such as the implementation of data analytics technologies. Due to this fact, the total adoption of health data analytics will in the next years lie behind that of EHR implementations. However, the increase will be even more significant. While in 2011 only one of ten US hospitals had implemented health data analytics solutions, in 2016 every second hospitals will have implemented some form of health data analytics. This represents an increase of 400 percent and a CAGR of 37.9 percent. Competition The market for big health data technology and solutions is highly competitive. For instance, a recent McKinsey evaluation of the big data marketplace revealed that since 2010 more than 200 businesses that offer innovative approaches for health data analytics and usage have emerged. Similar observations are made by a Frost & Sullivan study that identifies already today more than 100 competitors in the domain of hospital health data analytics. Source: (Frost &Sullivan, 2012a) and (Groves et al., 2013) Stakeholders: Roles and Interest Currently, there is a strong competition between the involved stakeholders of the health care industry. It is a competition for resources and the resources are limited. Each stakeholder is focused on his/her own financial interests, which often leads to sub-optimal treatment decisions. In consequence, the patient is currently the one who is suffering most. In the following we sketch the various roles and interests, business incentives and market positions of the involved stakeholder of the healthcare industry: Patients have interest in affordable, high quality and broad coverage of care. As of today, only very limited data about the patient health conditions is available and patient have only very limited opportunities to actively engage in the care process. Hospital operators are trying to optimize their income from medical treatments, i.e. they have a strong interest in improved efficiency of care, such as automated accounting routines, improved processes or improved utilization of resources. Clinicians and physicians are interested in more automated and less labour-intensive routine processes, such as coding tasks, in order to have more time available for and with the patient. BIG consortium Page 19 of 162

20 In addition, they are interested in accessing aggregated, analyzed and concisely presented health data that enable informed decision making and high quality treatment decisions. Payors, such as governmental/commercial payors or healthcare insurances: As of today, the current reimbursement systems manage fee-for-service payments using simple IT-negotiation and data exchange processes between payors and healthcare provider and do not rely on data analytics application. As payors are deciding which health services (i.e. which treatment, which diagnose or which preventive test) will be covered or not, their position and influence regarding the adoption of innovative treatments and practices is quite powerful. However, as today only limited and fragmented data about the effectiveness and value of health services is available, the reasons for treatment coverage often remain unclear and sometimes seem to be arbitrary. As of today, prevention-related health services are not refunded by insurances (beside some minor exceptions) and mainly paid by the patient him or herself. The reasons for that might be a short-term focused Return on Investment (ROI) calculation, as any implementation of preventive clinical care settings requires great and long-term investments before positive effects can be expected. However, assuming that the healthcare industry will continuously adapt the new paradigm of value-based healthcare, new models of reimbursement fostering outcome-orientation will emerge. In US, for instance, the US health care reform (PPACA) fosters the transition from feefor-service towards quality-based reimbursement models. However, for implementing qualitybased reimbursement models, payors will require a more holistic view of the healthcare process and the overall outcome of healthcare services in order to gain insights about the efficiency and effectiveness of treatments by using analytics tools on comprehensive patient data sets. Regulation & Government s aim towards a reduction of healthcare expenditure and at the same time an increase in the quality of care. The implementation of value-based healthcare services relying on big health data analytics applications are a promising step towards achieving this goal. Today, the position and interventions of regulatory and government-based bodies in Europe differ from country to country. However, any new regulations and laws - such as the development appropriate system incentives or legal foundations for data exchange and sharing - that are driven by a deep understanding of the interests of the single players, have the potential to significantly foster the implementation as well as adoption of big data-based healthcare applications. Pharmaceuticals 1, Life Science 2, Biotechnology 3 and Clinical Research: The discovery of new knowledge is here the main interest and focus. As of today, the various mentioned domains are mainly unconnected and accomplish their data analytics applications on single data sources. By integrating heterogeneous and distributed data sources, the impact of data analytic solutions is expected to increase significantly in the future. Medical Product Providers 4 are interested in accessing and analysing clinical data in order to learn about their own product performance in comparison to competitors products in order to increase revenue and/or improve the own market position. 1 Companies engaged in the research, development, or production of pharmaceuticals. 2 Companies enabling the drug discovery, development, and production continuum by providing analytical tools, instruments, consumables and supplies, clinical trial services, and contract research services 3 Companies primarily engaged in the research, development, manufacture, and/or marketing of products based on genetic analysis and genetic engineering. 4 Companies providing healthcare equipment and devices, such as medical instruments, imaging scanner, diagnostic equipment, surgery equipment etc., as well as companies providing information technology services primarily to healthcare provider, such as information systems (HIS, CIS, RIS, etc.), data exchange offerings, data processing and integration software, data analytics offerings, etc. BIG consortium Page 20 of 162

21 In an ideal world, each healthcare industry stakeholder would aim to establish the basis for preventive and pro-active care by means of comprehensive and integrated health data analytics. However, as of today, the world is far from being ideal. For transforming the current healthcare system into a preventive, pro-active, and value-based care, the seamless exchange and sharing of health data is needed. This again requires the effective cooperation between stakeholders. However, today the healthcare setting is mainly determined by incentives that hinder cooperation. For fostering the implementation and adaption of comprehensive big data application in the Healthcare Sector, the underlying incentives and regulations defining the conditions and constraints under which the various stakeholders interact and cooperate need to be changed. Source: Stakeholder Interviews and (Porter and Olmsted Teisberg, 2006) Value-based Healthcare Delivery: a new paradigm for effective collaboration To implement more effective healthcare delivery that allows to limit healthcare expenditure and at the same time help to increase the quality of care settings, value-based health care is becoming the focus of many healthcare reforms, such as the US healthcare reform. The overall idea is simple and straight-forward: in order to focus on the value in terms of improved outcomes - of healthcare, one aims to make the health data available that allows clinicians to identify best-practices which in future will provide guidance for the optimal utilization of resources for achieving best results. However, the creation of a value-based healthcare system won t be possible by incremental improvements, but will require the fundamental restructuring of the overall health care delivery. As of today, the healthcare industry is characterized by a number of stakeholders that are competing on the restricted number of resources. Usually, competition helps to improve the overall performance of an industrial setting. However, this is not the case for the healthcare industry, as the competition is not aligned with value 1. In other words, neither the patient's success nor the treatment performance is related to the financial incentives / success of the systems participants. For instance, a healthcare provider can receive high financial reimbursement although the treatment performance was sub-optimal. In order to avoid sub-optimal treatment performance and to establish a value-based healthcare delivery, positive-sum competition on value needs to be realized. Instead of accounting for cost containment and for accomplished volume or number of treatments, value-based reimbursement models will consider the long-term value for the patient. A second key principle of value-based healthcare is the maintenance of a healthy patient population which relies on the well-known fact that better health is inherently less expensive than poor health. Thus, any diagnostic interventions that helps to sustain or improve the patients health status, such as prevention, early diagnosis, right diagnosis, fewer complications and mistakes, early and timely treatments, etc., are important mechanisms in value-based healthcare settings. In this context, big data technology will play an important role to establish means to track and analyze treatment performance of patient populations. Source: (Porter and Olmsted Teisberg, 2006) and (Soderlund et al., 2012) Available Data Sources The health care system has several major pools of health data which are held by different stakeholders/parties: 1 with value being defined as the patient health outcome per Euro spent BIG consortium Page 21 of 162

22 Clinical data, which is owned by the provider (such as hospitals, care centres physicians, etc.) and encompass any information stored within the classical hospital information systems or EHR, such as medical records, medical images, lab results, genetic data, etc. Claims, cost and administrative data, which is owned by the provider and the payors and encompass any data sets relevant for reimbursement issues, such as utilization of care, cost estimates, claims, etc. Pharmaceutical and R&D data, which is owned by the pharmaceutical companies, research labs/academia, government and encompass clinical trials, clinical studies, population and disease data, etc. Patient behaviour and sentiment data, which is owned by consumers or monitoring device producer and encompass any information related to the patient behaviours and preferences. Health data on the web. Websites, like such as PatientsLikeMe, getting more and more popular: By voluntarily sharing data about rare disease or remarkable experiences with common diseases, their communities and user are generating large sets of health data with valuable content. As each data pool is held by different stakeholders/parties, the data in the health domain is highly fragmented. However, the integration of the various heterogeneous data sets is an important prerequisite of big health data applications and requires the effective involvement and interplay of the various stakeholders. Therefore as already mentioned - adequate system incentives, that support the seamless sharing and exchange of health data, are needed. Source: Stakeholder Interviews, (Frost &Sullivan, 2012a), (McKinsey & Company, 2011) Drivers and Constraints Constraints Digitalization of health data: Until today, only a small percentage of health-related data is digitally documented and stored. There is a substantial opportunity to create value if more data sources can be digitized with high-quality as well as made available as input for analytics solutions. Lack of standardized health data: (e.g. EHR, common models / ontologies): For establishing the basis for health analytics, health data across hospitals and patients need to be captured in a unified way. This can be accomplished by current technologies, such as Extract, Transform and Load (ETL), Health Information Exchange (HIE), EHR, common models or ontologies. Data silos: As of today, healthcare data is often stored in distributed data silos, which makes data analytics cumbersome and instable. Integrated data storage solutions, such as Data warehouses (DWHs), need to be available. Organizational silos: Due to missing incentives, cooperation across different organization and sometimes even between departments within one organization is currently rare and exceptional. Data security and privacy: As today legal frameworks defining data access, security and privacy issues and strategies are missing, the sharing and exchange of data is hindered. Simply because the involved parties lack procedures for sharing and communicating relevant findings, important data and information often remains siloed within one department, group or organization. High investments: The majority of big data applications in the healthcare sector rely on the availability of large-scale, high-quality and longitudinal health care data. The collecting and BIG consortium Page 22 of 162

23 maintaining of such comprehensive requires not only high investments, in addition -- when dealing with longitudinal data -- it usually takes some years time until the data sets are comprehensive enough for producing good analytics results. As such high and long-term based investments can hardly be covered by one single party, the conjoint engagement of most stakeholders, including the government, is needed. Missing business cases and unclear business models: Any innovative technology that is not aligned with a concrete business case including the associated responsibilities is likely to fail. This is also true for big data solutions. Hence, the successful implementation of big data solutions requires transparency about the following three questions: a) who is paying for the solution? b) who is benefiting from the solution? and c) who is driving the solution? For instance, the implementation of data analytics solutions using clinical data requires high investments and resources to collect and store patient data, for instance, by means of an EHR solution. Although, it seems to be quite obvious how the involved stakeholder could benefit from the aggregated data sets, it remains unclear whether the stakeholder would be willing to pay or drive such an implementation Drivers Increased volume of electronic health data: With the increasing adoption of EHR technology (which is already the case in US), the technological progress in the area of next generation sequencing and medical image segmentation, more and more health data will be available. Need for improved operational efficiency: To address the greater patient volumes (aging population) and to reduce the currently very high healthcare expenses, transparency about the operational efficiency is needed. Trend towards value-based healthcare delivery: Value-based healthcare relies on the alignment of treatment and financial success. In order to gain insights about the correlation between effectiveness and cost of treatments, data analytics solutions on integrated, heterogeneous, complex and large sets of healthcare data are demanded. US legislation: The US Healthcare Reform, also known as Obama care, fosters the implementation of EHR technologies as well health data analytics applications, which again has a significant impact on the international market for big health data applications. Trend towards increased patient engagement: First applications, such as PatientLikeMe, demonstrate the willingness of patients to actively engage in the healthcare process. Trend towards new system incentive: Current system incentives enforce high number instead of high quality of treatments. Although, it is very obvious that nobody wants to pay for treatments that are ineffective, this is still the case in many medical systems. In order to avoid low-quality reimbursements, the incentives of the medical systems need to be aligned with outcomes. Several initiatives, such as Accountable Care Organizations (ACO) or Diagnoserelated-Groups DRGs, have been implemented in order to reward quality instead of quantity of treatments. Source: Stakeholder Interviews and (Frost &Sullivan, 2012a) (O Reilly et al., 2012) Challenges that need to be addressed To summarize the above discussion, the following challenges need to be addressed in order to establish a basis for the successful implementation of big data health applications: Data Digitalization and Acquisition, i.e. the goal is to get health data in a form that can be used as input for analytic solutions. BIG consortium Page 23 of 162

24 Data Integration and Storage, i.e. the goal is to store health data in a more or less standard from that can be shared efficiently as well as that can be move easily and fast from one location to another location Data Security and Privacy, i.e. the goal is to establish legal procedures that allow the sharing and communication of data and findings. System Incentives, i.e. the goal is to establish system incentives and reimbursement models that foster value-based healthcare. Data Usage and Business Cases, i.e. the goal is to explore the until now mainly undiscovered and unclaimed potential business value of health data analytics Role of Regulation and Legislation Any innovation and change in medical systems, such as the implementation of big data-based technologies, not only affects but also relies on the involvement of many stakeholders, such as patients, health professionals, insurances, government bodies, etc. Regulation and legislation can help to stimulate the underlying incentives for collaboration as well as provide seed funding for required enabling technologies, such as means for data exchange and integration, for instance EHR, HIE or DWH solutions. Past efforts in implementing new healthcare reforms, showed that radical reforms that did neither incorporate nor reflect the views and interests of the involved stakeholders are very unlikely to be successful. However, reforms that are based on (many) incremental steps which are aligned with the interests of the various stakeholders are more likely to be effective in transforming the health system in the long run. Already today, several government-sponsored big data initiatives within the healthcare sector help to increase the transparency of treatments and thus the value for patients. For instance, The Swedish government increased their investments for expanding the Sweden s network of disease registries that are consolidating very valuable health data for subsequent analysis, from $10 million to $45 million per year by 2013 (Soderland et al., 2012). The Italian Medicines Agency collects and analyzes clinical data for evaluating the effectiveness of new, expensive drugs in order to re-evaluate prices as well as market conditions (Groves et al., 2013). Obama Care, the unofficial name of the US Health Care Reform which was signed by Obama in 2010, aims to increase the efficiency, quality as well as the coverage of current US healthcare system by various initiatives and programs. For instance, o the reform provides incentives to implement new payment and delivery models. One prominent example therefore is the Accountable Care Organizations (ACO), an organization with a specific legal structure that provides the basis for risk sharing between health provider and payor. Thus, beside focus on high quality care coordination, ACOs aim to achieve shared savings through the use of bundled payments reflecting the value of treatment episodes. In order to be able to track the value and performance of treatments, data analytics technologies are needed o In addition, the reform provides stimulating seed funding for replacing paper charts by EHR technology. Although, the implementation of EHR technology was optional, for most health provider it became quite attractive. So far, it is expected that the adoption of EHR technology in US hospitals will increase from 35 percent adoption rate in 2011 to 95 percent adoption rate in BIG consortium Page 24 of 162

25 This initial list of examples already demonstrates the important role of legislation and regulation for fostering the successful implementation and adoption of big data technology applications. Source: Stakeholder Interviews, (Frost &Sullivan, 2011), (Frost &Sullivan, 2012a) and (Groves et al., 2013) 1.3. Big Data Application Scenarios In the following, we will describe a selection of the big data application scenarios that we discussed with our interview partners so far. The detailed description of scenarios is intended to provide the reader a good overview of the business opportunities as well as accompanying challenges of big data technologies in the health care industry. However, it is important to mention, that the selection of application scenarios is neither comprehensive nor complete Comparative Effectiveness Research Description: The goal of this application scenario is to compare the clinical and financial effectiveness of interventions in order to increase efficiency and quality of clinical care services. Large datasets encompassing clinical data (information about patient characteristics), financial data (cost data) and administrative (treatments and services accomplished) are critically analyzed in order to identify the most clinical and most cost-effective treatments that work best for particular patients. Several stakeholder of the healthcare industry can benefit from such a scenario: Clinicians could receive recommendation about the most clinical effective treatment alternative for a particular patient. Hospital operator could receive a recommendation about the most financial effective treatment alternative for a particular patient. Payors could use the discovered knowledge about the most effective treatment to align their reimbursement strategy. Patients could benefit by receiving the most effective treatment in accordance to their particular health conditions. This application scenario is based on two steps a) the identification of most efficient treatment (Knowledge Discovery) and b) the improvement of clinical processes (Knowledge Usage). Currently the knowledge discovery step is covered by public-funded and/or governmental research agencies. To which extent and in which ways the usage of the discovered knowledge is aligned with new business models and value chains needs to be investigated in further detail. Comparative effectiveness research can be accomplished without (mainly manual data collection) or with BIG data technologies. However, the analysis of large and complex data sets that allows not only identifying hypothesis-driven but also data-driven patterns requires BIG data technologies. Example Use Cases: Several public-funded and/or governmental research agencies, such as the National Institute for Health and Care Excellence (UK), Institute for Quality and Efficiency in Healthcare (Germany), the Common Drug Review (Canada) or the Australian s Pharmaceutical Benefits Scheme, started to run Comparative Effectiveness Research programs aiming to discover knowledge about treatments effectiveness. However, to which extend the research of the mentioned agencies relies on BIG Data technologies will be the focus of our future investigations/interviews. User Value: User Impact: high impact in particular for chronic diseases (as chronic diseases are associated with high costs for all involved parties). BIG consortium Page 25 of 162

26 Maturity: some first implementation for severe diseases exist, however today those research studies are very costly due to the fact that they are realized via prospective studies. Financial Impact: varies, it depends on the status-quo of treatment costs for the various diseases. Prerequisites Data Digitalization and Data Integration of health data from various domains, such as clinical data, financial data, administrative data, disease data, etc. This data needs to be in good data quality and should reveal high coverage. Avoidance of biased Data Sets: randomly selected data sets for clinical studies are very likely to be biased. For instance, in average, elder patients receive more often the less expensive drug. As within prospective studies the data sets are manually selected, the issue of biased data sets can be addressed. Retrospective studies relying on BIG data technologies would need to find solutions to address this issue. Data Sources: Clinical data, administrative data and financial data: Large clinical data sets encompassing information about the patient characteristics and information about the cost and outcome of treatments. Type of Analytics: Advanced Analytics (by means of critically analyzing comprehensive clinical data sets, predications about the most effective (clinical and financial) treatments are made. Required Big Data Technologies: will be investigated together with the technical working groups. Sources: Stakeholder Interviews and (McKinsey Company, 2011) Clinical Decision Support Description: Clinical decision support (CDS) applications aim to enhance the efficiency and quality of care operations by assisting clinicians and healthcare professionals in their decision making process by enabling context-dependent information access, by providing pre-diagnose information or by validating and correcting of data provided. Thus, those systems support clinicians in informed decision making, which again helps to reduce treatment errors as well as helps to improve efficiency. By relying on big data technology, future clinical decisions support applications will become substantially more intelligent. Example Use Cases: Pre-diagnosis of medical images, treatment recommendation reflecting existing medical guidelines. User Value: User Impact: is very high for some selected scenarios. However percent of clinical decisions are routine tasks which do not required dedicated CDS. Maturity: some implementation exist, however CDS require long development time. Financial Impact: no data available and depends on focus of CDS. Prerequisites: Trust and Confidence are crucial that CDS systems will be accepted. As clinicians will only rely on CDS systems, if it is guaranteed that all relevant data sources are integrated, the aspect of comprehensive data integration in high data quality is an important prerequisite. Data Sources: Clinical data. BIG consortium Page 26 of 162

27 Type of Analytics: According to the type of clinical decision support system, it can rely on a) Basic Analytics (e.g. monitoring, reporting, statistics) b) Mature Analytics (e.g. data mining, machine learning) or c) Advanced Analytics (e.g. prediction, devise) Required Big Data Technologies: will be investigated together with the technical working groups. Sources: Stakeholder Interviews and (McKinsey Company, 2011) Clinical Operation Intelligence Description: Clinical Operation Intelligence aims to identify waste in clinical processes in order to optimize them accordingly. By analyzing medical procedures, performance opportunities, such as improved clinical processes, fine-tuning and adaptation of clinical guidelines, can be realized. Two user groups can benefit from clinical operation intelligence: Healthcare professionals gain further insights into the effectiveness of treatment decisions and processes and can adapt their decisions accordingly. Patients are informed about the effectiveness of treatments and can select those treatments that offer best value for them. As of today, the value and effectiveness of clinical processes is only evaluated in manually accomplished clinical studies. By using big data technology, information about the value of single treatment can be inferred automatically. Until now, the underlying model is unclear. Example Use Cases: Publishing cost, quality and performance data of various departments or hospitals creates competition that in consequence will drive performance improvements. User Value: User Impact: depends on the area of analysis Maturity: as the legal framework regulating the use of data is missing, non implementation exist so far. Financial Impact: no data available Prerequisites: Data Security and Privacy Requirements: As this scenario relies on the seamless data access for any involved party, a common legal framework regulating the use of patient data is required. Engagement of Clinicians: Any leanings and adaptations regarding clinical guidelines need to be initiated and approved by healthcare professionals in order to be accepted by the clinical community. Data Sources: Clinical, administrative and financial data and outcome data. Type of Analytics: Clinical operation intelligence can already be realized by means of basic analytics (e.g. monitoring, reporting and statistics). Required Big Data Technologies: will be investigated together with the technical working groups. Sources: Stakeholder Interviews and (McKinsey Company, 2011) 1 1 In their report, McKinsey labels the application scenario Transparency about medical data. BIG consortium Page 27 of 162

28 1.3.4 Secondary Usage of Health Data Description: We define secondary usage of health data as the aggregation, analysis and concise presentation of clinical, financial, administrative as well as other related health data in order to discover new valuable knowledge, for instance to identify trends, predict outcomes or influence patient care, drug development, or therapy choices. Example Use Cases: Depending on the type of data analyzed as well as the value/new insights generated, the user as well as the business case/models will differ. Example 1: Identification of patients with rare disease Big data technology is used to identify (early detection) of patients with rare diseases. The information is valuable a) for pharmaceutical companies as they can use it to identify future customers that will buy their drugs, b) for hospitals as they can show that they demonstrate on a high level of quality, and c) for government and payors as the early detection of diseases is usually less expensive than late detection. Example 2: Patient recruiting and profiling Big data technology is used for the recruitment of new patients that are suitable for conducting clinical studies. Today, often clinical studies, in particular studies investigating rare diseases, fail due to the fact that not enough patients are available. Example 3: Forecast of clinical process values Comprehensive health data sets are analyzed in order to make future forecasts (predictive analysis) regarding relevant clinical benchmarks, such as the expected health care spending in the next year or the forecasted utilization degree of dedicated resources, such as MR scanners or operation facilities. Example 4: Health knowledge broker Health-related data is analyzed to develop commercialization plans or portfolio strategies for third party companies. For instance, the analysis of utilization and consumption patterns of medications bears valuable insights that can be used to improve the marketing strategy of pharmaceutical companies (see IMS Health: User Value: User Impact: depends on the quality of data and the relevance of question answered by the scenario Maturity: the technology is available; however the successful implementation requires the availability of a convincing use case. Financial Impact: depends on the use case Prerequisites: Integrated data in high quality: the value of data analysis depends on the integration of comprehensive and complete data as well as on the quality of input data Business case: any successful implementation requires a clear business case. Privacy and security of data: a common legal framework specifying data access control & policies of data usage Standards, such as International Classification of Diseases, 10 th revision (ICD-10), Health Level Seven (HL7), are needed to establish a common semantic (re-) used data items Data Sources: Clinical, administrative, and financial data, pharmaceutical and R&D data, patient behaviour and sentiment data, data from related external knowledge sources Type of Analytics: Depending on the implemented business case either a) basic analytics (e.g. monitoring, reporting, statistics) b) mature analytics (e.g. data mining, machine learning) or c) advanced analytics (e.g. prediction, devise) are used BIG consortium Page 28 of 162

29 Required Big Data Technologies: will be investigated together with the technical working groups. Sources: Stakeholder Interviews and (PriceWaterhouseCoopers, 2009) Public Health Analytics Description: Public health analytics applications rely on the comprehensive disease management of chronic (e.g. diabetes, congestive heart failure) or severe (e.g. cancer) diseases that allow to aggregate and analyse treatment and outcome data which again can be used to reduce complications, slow diseases progression, as well as improve outcome, etc. Several stakeholders could benefit from the availability of broad national and international infrastructure for public health analysis: Payors: As of today, payors lack the data infrastructure required to track diagnoses, treatments, outcomes, and costs on the patient level and thus are not capable to identify best-practice treatments Government: can reduce healthcare cost & improved quality of care Patients: can access improved treatments according to best-practice knowhow Clinicians: usage of best-practice recommendation and informed decision making in case of rare diseases However, the benefits and opportunities of public health analysis rely on high investments to establish and manage the required common standards, the legal framework, the shared IT infrastructure and associated community and, therefore, depend on the collaborative engagement and involvement of all stakeholders. Example Use Case Success Story Sweden: Since 1970, Sweden established 90 registries that cover today 90 % of all Swedish patient data with selected characteristics (some cover even longitudinal data). A recent study showed that Sweden has best health-care outcomes in Europe by average healthcare costs (9% of GPD). User Value: User Impact: can become very high, but relies on the comprehensiveness and quality of the collected data in the registries. Maturity: varies from country to country, for instance Sweden has already today a very high coverage of disease registries. Financial Impact: can be quite impressive, for instance: Sweden reduced its annual growth of healthcare spending from 4.7 to 4.1%. This represents an estimated cumulative return over ten years of total more than $7 billions 1 in reduced direct costs. Those cost savings could be achieved with annual investments of $70 millions in disease registries, data analysis, and IT-infrastructure. Prerequisites: Clinical Engagement, i.e. active engagement, i.e. clear responsibility for data collection and interpretation, by the clinical community National Infrastructure, i.e. common standards, shared IT platform and common legal framework defining the data privacy and security requirements for tracking diagnosis, treatments, and outcomes on the patient level High-Quality Data that is achieved through systematic analysis of health outcome data of a population of patients and System Incentives that rely on the active dissemination and usage of outcome data 1 billion understood as one thousand million BIG consortium Page 29 of 162

30 Data Sources: Clinical, administrative, and financial data, outcome data Type of Analytics: Mature analytics (e.g. data mining, machine learning) as well as advanced analytics (e.g. prediction, devise) Required Big Data Technologies: to be investigated together with technical working groups. Sources: Stakeholder Interviews, (McKinsey Company, 2011) and (Soderlund et al, 2012) Patient Engagement Application Description: The idea is to establish a platform/patient portal that fosters the active patient engagement in the context of patients health care processes. The patient platform offers smart phone apps and devices to its members/patients to monitor health-related parameters, such as activity, diet, sleep or weight. The underlying assumption is that patients who are able to continuously monitor their healthrelated data are encouraged to improve their life style as well as improve their own care conditions. The collected patient data biometric and lifestyle data gets aligned with the clinical data stored in the EHR record of all past encounters. In addition, the data of related patient populations are compared in order to identify successful interventions or typically patterns that are leading more likely to successful treatments, health progress, etc. Several stakeholder of the healthcare industry can benefit from this scenario: Patients: The increased patient engagement through actively producing and providing health data might improve overall wellness and health conditions. Payors/Government: Reduced healthcare cost through preventive care. Clinicians: Informed decision making through access to heterogeneous data sources, such as biometric, device or clinical data. Although the benefits and user values are transparent, the underlying business models and business values is not clear yet. In order to address the challenge of integrating and analyzing the heterogeneous data sources, such as biometric, device, and clinical data, mobile patient portals require big data technology. Example Use Cases: For instance, the development of predictive models that allow evaluating and predicting the successful patient behaviour in a particular health program can help to replicate supporting influence factors or to identify reasons why some patient gave up a program, etc. User Value: User Impact: positive economic impact is expected, concrete numbers are missing. Maturity: the required technological ingredients (such as biometric devices, DWH technology or unified data architecture) are available. Financial Impact: no numbers available, but depends on the patient population and selection of data and analysis. Prerequisites: Business case: a convincing business case describing the business value and model needs to be investigated. Commitment of stakeholder, such as patients, clinicians, and payors. Data Sources: Clinical data, patient behaviour and sentiment data including biometric and lifestyle data. Type of Analytics: Advanced Analytics (e.g. prediction, devise). BIG consortium Page 30 of 162

31 Required Big Data Technologies: will be investigated together with the technical working group. Sources: Stakeholder Interviews 1.4. Conclusion and Next Steps This chapter specifies the term big data for healthcare sector, describes the needs and requirements of the healthcare sector, identifies possible applications scenarios as well as discusses the roles and interest of the involved stakeholders. The results are based on five stakeholder interviews (until now no representative and balanced selection of interview partners) as well as on the intensive research of related studies and publications. For integrating more perspectives into our discussion regarding the needs and opportunities of big data technology in the healthcare sector, we plan to conduct further interviews with clinical experts, experts from the pharmaceutical sector, imaging or genetic domain, governmental bodies, as well as medical product providers. In addition, we will investigate successful implementations of big data application in the healthcare domain in order to gain transparency about success factors as well to learn about roadblocks that need to be removed Abbreviations and acronyms ACO BI CAGR CDS CER DRG DWH EHR ETL GPD HIE HL7 ICD-10 IDN KPI R&D ROI Accountable Care Organisation Business Intelligence Compound Annual Growth Rate Clinical Decision Support Comparative Effectiveness Research Diagnose-Related Groups Data warehouse Electronic Health Record Extract, Transform and Load Gross Domestic Product Healthcare Information Exchange Health Level Seven International Classification of Diseases, 10 th revision Integrated Delivery Network Key Performance Indicator Research and Development Return on Investment BIG consortium Page 31 of 162

32 1.6. References Frost & Sullivan (2012a). U.S. Hospital Health Data Analytics Market. Frost& Sullivan (2012b). Analysis of Venture Capital Investment Trends in the European Healthcare Industry. Frost & Sullivan (2011). Impact of Healthcare Reforms on the Medical Technology Industry. P. Groves, B. Kayyali, D. Knott and S. Van Kuiken (2013). The big data revolution in healthcare. McKinsey & Company McKinsey & Company (2011). Big data: The next frontier for innovation, competition, and productivity. S. Meystre, G. Savova, K. Kipper-Schuler, and J. Hurdle (2008). Extracting information from textual documents in the electronic health record: A review of recent research. Yearbook of Medical Informatics. M. Porter and E.Olmstead Teisberg (2006). Redefining Health Care: Creating Value-Based Competition on Results. Boston: Harvard Business Review Press. PricewaterhouseCoopers (2009). Transforming healthcare through secondary use of health data. T. O Reilly, J. Steele, M. Loukides and C. Hill. (2012). Solving the Wanamaker problem for health care. Online: S. Seifert, A. Barbu, S. Zhou, D. Liu, J. Feulner, M. Huber, M. Suehling, A. Cavallaro, D. Comaniciu (2009). Hierarchical parsing and semantic navigation of full body CT data. In: SPIE Medical Imaging. N. Soderland, J. Kent, P. Lawyer and S. Larsson (2012). Progress Toward Value-Based Health Care. Lessons from 12 Countries. The Boston Consulting Group, Inc. BIG consortium Page 32 of 162

33 2. Public Sector 2.1. Introduction Public sector is increasingly aware of the potential value to be gained from Big Data. Governments generate and collect vast quantities of data through their everyday activities, such as managing pensions and allowance payments, tax collection, National Health System patient care, recording traffic data and issuing official documents. BIG is taking into account current socio-economic and technological trends, like boosting productivity in an environment with significant budgetary constraints, the increasing demand of medical and social services, and the standardization and interoperability as important requirements for public sector technologies and applications. Some examples of potential benefits are: Open Government and Data sharing. The free-flow of information from organizations to citizens promotes greater trust and transparency between citizens and government, in line with open data initiatives. Pre-filling of information (based on the once only principle) would be another benefit, with a reduction of mistakes and speeding up processing time. Sentiment analysis. Information from both traditional and new social media (websites, blogs, twitter feeds ) can help policy makers to prioritize services and be aware of citizens real interests and opinions. Citizen segmentation and personalization. Segmenting and tailoring government services to individuals can increase effectiveness, efficiency, and citizen satisfaction Economic analysis. Correlation of multiple sources of data will help government economists with more accurate financial forecasts. Tax agencies. Automated algorithms to analyse large datasets and integration of structured and unstructured data from social media and other sources will help them validate information or flag potential frauds. Threat detection and prevention. Track and analyse citizen activities to spot abnormal behavioural patterns. Cyber Security. Collect, organize and analyse vast amounts of data from government computer networks with sensitive data or critical services, to give cyber defenders greater ability to detect and counter malicious attacks. As regards the scope of public sector, according to the Green Paper on PSI, (European Commission, 1998), in the functional approach, the public sector includes those bodies with state authority or public service tasks, and established for the specific purpose of meeting needs in the general interest, not having an industrial or commercial character; having legal personality; and financed, for the most part, by the state, or regional or local authorities, or other bodies governed by public law Industrial Background The evolution of Information Technologies in the public sector has developed in parallel to the evolution in the private sector, often, taking advantage of the solutions developed for the last one. However, since the massive advent of Internet technologies many European Governments realized of the potential of establishing a new channel of communication with citizens and BIG consortium Page 33 of 162

34 businesses, available anytime from everywhere. This denoted the existence of internal data silos, since one public department or agency had no integration with the systems from other public bodies that should provide information for a given administrative process. This means that, in many cases, the citizen interacting through e-government applications, or even in person in the public offices, had to provide information already available by the public administration, but not available in that specific public service. In this section we will analyse how the available Big Data technologies and the current situation of budgetary constraints and other structural factors like aging populations, may provide an opportunity to boost this sector productivity and optimization Characteristics of the European Public Sector The European Public Sector is one of the most developed in relation to the services it provides. Nevertheless, the current budgetary constraints, due to the financial crisis, are pressing European Governments to reduce public debt levels. According to OECD statistics, European public sector accounts for 10 to 30 per cent of GDP expenditure. This situation will have a long term impact across Europe s public budgets. Besides this, there is a major structural impact, the population pyramid with aging populations, which will lead an increasing demand for medical and social services in the coming decades. According to (McKinsey Global Institute, 2011) by 2025 near 30 per cent of the population in developed countries will be aged 60 or over. Since public services in Europe provide most of these health and social services, European governments will have to face an optimization of this type of expenditure. The optimization cited above translates into raising public sector productivity that is, enhancing performance. Public sector is the major employer in advanced economies, but it lacks productivity growth compared to the private sector. So, the public sector is not taking full advantage of the technological and organization improvements that private sector is applying. In addition, as stated by (Bossaert, 2012), according to OECD statistics, over 30% of public employees of central government in 13 countries will leave during the next 15 years. Moreover, the public sector, compared to the private sector, relies on a far older workforce, who will have to work longer in the future Stakeholders: Roles and Interest According to (Correia, 2004), the public sector stakeholders in relation to the Public Sector Information (PSI) point of view can be classified in two main categories: societal and state stakeholders. While the first comprises citizens and businesses, the second comprises policymakers and administrations. Figure 1 provides a vision of the public sector information system, where the different groups of stakeholders involved are seen as entities (people and organisations) with distinctive characteristics and playing different roles. BIG consortium Page 34 of 162

35 Figure 1. Public Sector Information Stakeholders in the PSI system (Correia, 2004) According to the previous classification, the following stakeholders in Table 1 had been identified, that may have interests in the course of the development of Big Data in the European public sector: Category Group Stakeholder Interest in Big Data Society Citizens EU citizens Citizen organizations which take care of the improvement of public services through the use of ICT, and also those who care about the exploitation of personal data held by Public Administrations. See Citizens for Europe site ( for a representative sample of civil organizations concerned about these issues. Businesses European SMEs ICT Industry SMEs organizations which take care of the improvement of public services through the use of ICT. ICT companies, alliances or associations dealing with the application of Big Data to the public sector. BIG consortium Page 35 of 162

36 Category Group Stakeholder Interest in Big Data State Policymakers European Commission The European Commission has two main bodies which deal with the policies for the use of PSI and ICT: DG Informatics (DIGIT): According to its mission statement its goal is to enable the Commission to make effective and efficient use of Information and Communication Technologies in order to achieve its organisational and political objectives DG Communications Networks, Content and Technology (CNECT): According to its mission statement this DG helps to harness information & communications technologies in order to create jobs and generate economic growth; to provide better goods and services for all; and to build on the greater empowerment which digital technologies can bring in order to create a better world, now and for future generations. Governments of the EU countries Administrations Administrations from EU countries and the EU itself Each government from the countries of the EU has competences in the framework of the European legislation about the policies on the use and exploitation of PSI. They are responsible for the organization of Public Administration in each country, its systems simplification, procedures and forms. Therefore responsible for the procedures for the exploitation of PSI. Public Sector bodies and agencies responsible for the management of PSI. It refers to all levels of administration; national, regional and local, as well as to public agencies and companies (see below in section Available Data Sources Nature of the data, the definition of the public sector scope). Table 1. PSI stakeholders Available Data Sources Nature of the data First, we should have a clear view of what is the information available in the public sector. In Directive 2003/98/EC (The European Parliament and the Council of The European Union, 2003), on the re-use of public sector information, defines PSI as follows: It covers any representation of acts, facts or information - and any compilation of such acts, facts or information - whatever its medium (written on paper, or stored in electronic form or as a sound, visual or audio-visual recording), held by public bodies. A document held by a public sector body is a document where the public sector body has the right to authorise re-use. According to (Correia, 2004), concerning the availability of the information produced by those public bodies, and in the absence of specific guidelines, the producing body is free to decide how to make it available: directly to the end-users, establishing a public/private partnership, or BIG consortium Page 36 of 162

37 outsourcing the commercial exploitation of that information to private operators. The Directive 2003/98/EC clarifies that activities falling outside the public task will typically include supply of documents that are produced and charged for exclusively on a commercial basis and in competition with others in the market. About the nature of the PSI available, there are several approaches. The Green paper on PSI (European Commission, 1998) proposes some classifications like the following in Figure 2 and Figure 3. Figure 2. PSI distinction between administrative and non-administrative Figure 3. PSI distinction regarding its relevance Additionally it can be distinguished according to its potential market value. And in some cases according to the content of personal data (see Figure 4). BIG consortium Page 37 of 162

38 Figure 4. PSI distinction according to its anonimity The most important amount of data produced by public sector is textual or numerical, versus other sectors like health care which produces a large amount of electronic images. As a result of e-government initiatives undertaken during the past 15 years a great part of this data is created in digital form, 90 per cent according to (McKinsey & Company, 2011). Even though, one major problems public sector is facing, is the low level of integration of information among public bodies, some due to cultural heritance, like the lack of a central identity database in countries like UK, and others due to the fact that every public body was performing like a closed organization, and therefore organizing their information in data silos Big Data in the public sector As of today, there are no broad implementations of Big Data in the public sector, neither it is has been a sector that traditionally has been using data mining technologies in such an intensive way as other industrial sectors have done. However, there is a growing concern among public sector of the potentials of Big Data for the improvement of public sectors in the current financial environment, as described above in section Characteristics of the European Public Sector. As an example of the growing awareness among public sector globally is the announcement made by the Obama administration (The White House, 2012) on the Big Data Research and Development Initiative where six Federal departments and agencies will announce more than $200 million in new commitments that, together, promise to greatly improve the tools and techniques needed to access, organize, and glean discoveries from huge volumes of digital data.. Another example from the U.S. is the survey released by the TechAmerica Foundation (TechAmerica Foundation) and commissioned by SAP AG about Big Data and the public sector in the U.S. The key facts of this survey are shown in Figure 5. Some of the results of the survey showed that: 83% of Federal IT officials say Big Data can save 10% or more from the federal budget. Police department are now using real-time Big Data tools to develop predictive models about when and where crimes are likely to occur, dramatically reducing the overall crime rate in some areas. Real-time Big Data enables law enforcement to find patterns in crime data that might have otherwise gone unnoticed. Officials can then station officers BIG consortium Page 38 of 162

39 or position other resources in order to prevent and pre-empt crime, instead of simply solving it. Figure 5. Key facts and headline-worthly figures (TechAmerica Foundation) Big Data benefits in the public sector We have grouped the benefits of Big Data in the public sector in three major areas, based on a classification of the types of benefits (effectiveness and efficiency) and ground-breaking features (analytic), as showed in Figure 6 below: Big Data analytics. This area covers applications that can only be performed through automated algorithms for advanced analytics to analyse large datasets for problem solving that can reveal data-driven insights. Such abilities can be used to detect and recognise patterns or to produce forecasts, not possible to perform without such technical means. Some examples of application in this area are: Fraud detection (tax, pensions, unemployment benefits, public subsidies to businesses, money laundering). (McKinsey Global Institute, 2011) Supervision of regulated activities in the private sector (on-line game, energy and financial markets). Sentiment analysis, through the tracking of information from internet content, including social networks. This can help policy makers to the prioritization of new services or to uncover potential areas of civil unrest. (Oracle, 2012) BIG consortium Page 39 of 162

40 Threat detection from external data-sources (social networks, media and Internet content) for application in homeland security, crime prevention, national intelligence and cyber security of critical infrastructures. (Oracle, 2012) Threat detection from internal data-sources (government data networks) for application in government cyber security from both, internal and external attacks. (Oracle, 2012) Predictive analytics for planning purposes of public services based on forecasts in given scenarios (education, social services for elderly, public transport, etc ) or to perform analysis and forecasts on fundamental areas of economic activity (e.g. financial, food and raw materials markets). (Yiu, 2012) Figure 6. Areas of improvement through Big data usage in Public Sector Improvements in effectiveness. This area covers the application of Big Data to provide greater internal transparency, thus producing an increase of productivity with respect to current processes among public bodies, and externally, to citizens and businesses, providing them access to public data. Citizens and businesses can take better decisions and be more effective, and even create new products and services thanks to the information provided. Some examples of applications in this area are: Data availability though public agencies, making data available across agencies and organizational silos, reducing search times and automating access to data. (McKinsey Global Institute, 2011) Sharing and transparency of information through public sector organizations avoiding problems from the lack of a single identity database (like in UK), or providing solutions to fulfil the once-only principle, therefore not requesting information from citizens and businesses that is already available within public administration. It also facilitates the pre-filling of information on forms for tax declaration, making life easier for taxpayers and avoiding errors. (Yiu, 2012) BIG consortium Page 40 of 162

41 Open government and Open data. By facilitating the free-flow of information from public organizations to citizens and businesses greater trust between citizens and government is promoted. In addition, more governments are beginning to adopt Open Data by making raw government databases available to the public. This raw data can be re-used in innovative processes combined with other multiple datasets from different sources to provide new and innovative services to citizens. (McKinsey Global Institute, 2011) Improvements in efficiency. This area covers the applications that provide better services and continuous improvement based on the personalization of services and learning from the performance of such services. Some examples of applications in this area are: Personalization of public services to adapt to citizen needs. This is achieved through the segmentation and tailoring of public services to individuals, and by increasing efficiency and citizen satisfaction. For example providing a tailored service to unemployed people in the employment agency providing personalized guidance and even a training plan to adapt their skills to the current needs in the job market. This segmentation can also be used for tax audits to target specific segments of taxpayers more prone to commit fraud or with a professional activity more difficult to control. (McKinsey Global Institute, 2011) Improving public services through internal analytics, based on the analysis of performance indicators. Exploiting information already available from current processes can help to improve performance and compare it across different geographical units, or even provide information on vendors and service providers, allowing better procurement decisions, and therefore, saving money. (Yiu, 2012) In relation to the use of information from internet content, including social networks, it should be noted that the data that users provide voluntarily, or as a requirement for users to create their accounts, is not always reliable, and quite often extracted data may be clearly biased from reality. It is worth remebering Peter Steiner s 1993 classic catoon On the Internet nobody knows that you are a dog through a reviewed version. Figure 7. Social media data is not always relable, but some times it is (Cottingham, 2010) Not everybody on the internet says the truth about themselves, therfore social content is not always a reliable source of information. However through the connections a user of a social BIG consortium Page 41 of 162

42 network has, it is possible to determine the actual user profile. As implied in Figure 7, dogs friends' are dogs too. From the surveys performed during the elaboration of this report the following benefits where reported for the public sector: In the framework of the BIG project, a survey has been done in order to understand the state of the art as far as Big Data adoption is concerned. The survey has been distributed to 8 public administrations, and so far only 4 have answered. On the 16 th April 2013, the First workshop Building Europe s roadmap for Big Data in the Public Sector was held in Madrid. During this workshop additional questionnaires were distributed. It is planned that the conclusions from this workshop and the surveys collected will be included in the final version of the requirements. We include here and in the following sections some interesting insights this survey has produced so far. The survey can be found in Annex a. Figure 8. How aware is your Organization of Big Data business opportunities? According to Figure 8, none of the interviewed organizations are using currently Big Data. One is currently building its scenario to plan the deployment of a Big Data solution, another has medium term plans to use Big Data, and the others have not defined a strategy yet. BIG consortium Page 42 of 162

43 Figure 9. In your opinion, what benefits for your organization will the use of Big Data have? According to the results presented in Figure 9, internal efficiency and sharing data with 3 rd parties are the greatest benefits for the public sector organizations. Figure 10. What data do you think would be valuable to collect for your Big Data strategy? According to results from Figure 10, the most valuable information to collect for Big Data strategy is historical, current, real time, and external data. BIG consortium Page 43 of 162

44 2.2.6 Driver and Constraints Public sector is positioned to benefit largely from Big Data as long as burdens to its use can be overcome. This sector is both transaction and user (citizen) intensive, so it can apply most of the Big Data benefits, in particular those based on the segmentation of citizens and analytics on large datasets. The sector has a low productivity compared to others, so improvements on performance can be gained quickly with the use of internal analytics and new applications of Big Data. According to the survey among U.S. state IT officials released by the TechAmerica Foundation (TechAmerica Foundation), Big Data can help to improve many areas in public sector services (see Figure 11 below) while presenting a potential 10% of savings from public budgets. These arguments are welcome in these times of economic turbulence that challenge public spending, which from the operational point of view seems to support the implementation of these initiatives while ensuring a reasonable ROI. Figure 11. Areas of improvement of public services in the U.S.(TechAmerica Foundation) Another potential area of development is where governments can act as catalysts in the development of a data ecosystem through the opening of their own datasets, and actively managing their dissemination and use (World Economic Forum, 2012). In this regard, Open Data initiatives are a starting point for boosting a data market that can take advantage from open information (content) and the Big Data technologies. Therefore active policies in the area of Open Data can benefit the private sector, and in return facilitate the growth of this industry in Europe, which is one of the goals of the BIG initiative. At the end this will benefit public budgets with an increase of tax incomes from this growing European data industry. BIG consortium Page 44 of 162

45 2.2.7 Challenges that need to be addressed Obviously, Big Data is not a beautiful landscape with no obstacles. Many problems are foreseen in the application of these high potential technologies in the public sector. Privacy and security issues. Big Brother syndrome is something that can easily be evoked when one is aware of the treatment of large amounts of personal data in the hands of public services along with a very high technological capacity for the processing of these data. The ability to cross this information with other data sets, which when combined may reveal highly sensitive personal and security information, not only compromising individual privacy but also civil security. Individual privacy and public security concerns must be addressed before governments and society actors can be convinced to share data more openly, not only publically but sharing in a restricted manner with other governments or international entities. Big Data skills. There's a lot of hype surrounding the Big Data, but it is certainly here to stay, so as far as it will be massively adopted in business it will become harder to find skilled Big Data professionals. In the public sector areas where Big Data is more actively pursued at research and intelligence agencies there are currently workers with such skills available to manage Big Data projects. However, in a few years time, when there will be a high demand in Big Data skills across the government agencies and in the industry in general. Public body Agencies could go a fair distance with the skills they already have, but then they ll need to make sure those skills advance (1105 Government Information Group). Figure 12. What are the most important key challenges you would face for adopting Big Data? As it can be seen in Figure 12 all challenges are see more or less equally challenging, but the skills and the difficulty of the process are slightly perceived more challenging. It is expected that with additional surveys collected a more defined position will be met Role of regulation and legislation Two regulatory aspects have a specific impact on Big Data in public sector. One is the data protection legislation, and the second one is the PSI legislation. New Data Protection Directive. The currently in force EU Directive, was created to regulate the progression of personal data within the European Union. Officially known as the Directive 95/46/EC the legislation is part of the EU privacy and human rights law. BIG consortium Page 45 of 162

46 The main changes proposed in the new directive, currently under review, are the following: One regulatory framework across Europe: The aim of the new European Data Protection Regulation is to harmonise the current data protection laws in place across the EU member states. The fact that it is a regulation instead of a directive means it will be directly applicable to all EU member states without a need for national implementing legislation. Data breach notification: The revised framework is widely expected to require organisations to notify users and authorities about data breaches within 24 hours. Companies need to take responsibility for the data they own, and it is vital for end users to be aware of compromised information so they can take protective measures such as changing passwords. Right to be forgotten: One of the most contentious proposals is the right to be forgotten. The proposal says people will be able to ask for data about them to be deleted. Organisations will have to comply unless there are legitimate grounds to retain the data. Internet users must also give explicit consent to use data about them, be notified when their data is collected, and be told for what purpose it is being processed and how long it will be stored. Regulatory intervention: In general, organisations can also expect greater regulatory intervention, with wider powers and an expanded role for supervisory authorities. Firms that fail to comply with the proposed new rules will be fined a percentage of their global revenues, although the exact level is not yet clear, with reports ranging from 1% to 5%. Some analysis on how this directive affects Big Data and Open Data initiatives have been published (Hunton & Williams LLP, 2013). In particular two scenarios have been assessed: When Big Data is analysed to detect trends and correlations in the information the data controller s technical and organizational safeguards are paramount. The data controller needs to be able to achieve functionally separate processing of existing personal data for big data purposes as well as guarantee the confidentiality and security of the data. When the processing of big data directly affects individuals. It is considered that specific opt-in consent will almost always be necessary. In particular, organizations should provide data subjects and consumers with easy access to profiles and disclose their underlying decision criteria. The specific provisions of the Data Protection Directive relating to historical and statistical research are relevant to big data processing. Regarding open data, it is considered that the publication of personal data does not exclude the application of data protection law. Data protection law applies as soon as information relating to identified or identifiable individuals is processed, whether or not the information is publicly available. New PSI directive. The new directive is currently in the negotiation phase. The update of the 2003 PSI Directive has the objective to reach a Europe-wide consensus in making PSI readily available, which would help bridge the current gap between member states' levels of openness regarding non-personal data that is produced, stored, or harvested by the public sector. The EC strongly believes that more open PSI means citizens have at their disposal reliable knowledge regarding Government, thus enabling them to participate actively in the public arena, fostering a sort of 'E-democracy'. Most notable is the novel insistence that disclosing PSI data for reuse be obligatory. The parent version of the directive had merely encouraged this practice, leaving it at a suggestion. Now, European national governments will be required to provide access to all PSI data - ranging from digital maps to weather data to traffic statistics - at zero or marginal cost. Also new is the explicit BIG consortium Page 46 of 162

47 inclusion of cultural institutions, such as museums, libraries, and archives. The expected effect of this new set of guidelines is also to generate income (EPSIplatform, 2013). One main concern public sector may have is about the obligation disclosing PSI it must be clear who will be responsible for related costs, as the local, and some regional administrations may not have budget to set-up required infrastructures unless some public platforms are set-up an made available to fulfil such obligations. How this effort will be funded is not clear and will be a key issue to facilitate the availability of the mandatory data, especially in countries that are under the spotlight for deficit reduction Big Data Application Scenarios Monitoring and supervision of on-line gambling operators Monitoring and supervision of on-line gambling operators Description Example Use Case This scenario is a real need, but not yet implemented. The agency is making steps to implement the infrastructure that will allow performing this use case. What is the application scenario about? To monitor the on-line gambling operators What is the goal of the application scenario? Control of gambling operators and detection of fraud. What problems can be solved by means of the application scenario? The amount of date received in real-time, on daily, and monthly basis cannot be processed with standard database tools. And how are those issues handled today? The data is received and stored, but no active analysis is performed, only ondemand. Who are the users? The user is the public body in charge of the supervisory activity. What is the underlying business model? Who is paying for the new value generated? Who is providing the solution? It is a regulatory obligation from the public administration, the on-line gambling operators must provide the information to the regulatory public through a specific communications channel. Why is it Big Data? Real-time data is received from gambling operators every five minutes. Gambling operators must send to the supervisory body information on the following frequency and content. Daily and monthly information on (growth rate of 50%) User registration Game accounts Gambling operator account Jackpots and living games BIG consortium Page 47 of 162

48 User Value Prerequisites Data Sources Application Domain Type of Analytics Required Big Data Technology Sources Real-time information (growth rate of 50%) Games The supervisor still has to define the use cases on which to apply the analysis of data received. User Impact: high, due to the potential for fraud detection and criminal investigation. In the future, under court request, data will be analysed aggregating bank account operations and tax information. Maturity of Application Scenario: Implemented manually (functionality available but without Big Data). Financial Impact: Not defined yet. > Data Security and Privacy Requirements Operational data from on-line gambling operators Supervisory process Advanced Analytics (e.g. pattern recognition) Data acquisition: yes, of information collected from gambling operators Data analysis: yes, for the purpose of supervision and fraud and criminal prosecution. Data curation: n/d Data storage: yes, of operative information received Data usage: n/d Interview with General Directorate for Gambling Supervision, Ministry of Finance and Public Administration, Spain Operative efficiency in Labour Agency Operative efficiency in Labour Agency Description This scenario is a real implementation in use. What is the application scenario about? German Federal Labour agency to improve customer services and cut operations costs. What is the goal of the application scenario? Enable a new range of personalized services. What problems can be solved by means of the application scenario? Personalize services at a minimum cost. And how are those issues handled today? Everybody was receiving the same standard services despite having different profiles. BIG consortium Page 48 of 162

49 Example Use Case User Value Prerequisites Data Sources Application Domain Type of Analytics Required Big Data Technology Who are the users? Unemployed workers. What is the value generated? Reduce spending by 10 billion yearly and reduce the amount of time that unemployed workers took to find employment. What is the underlying business model? Who is paying for the new value generated? Who is providing the solution? It is an efficiency improvement of an existing public service. Why is it Big Data? It analysed historical data on its customers, including histories, interventions and the time they took to find a job, to develop a segmentation based on this analysis. Based on the segmentation the Labour agency could tailor its interventions for unemployed workers. The agency built capabilities for producing and analysing data that enabled a range of new programs and new approaches to existing programs. The Labour agency is now able to analyse outcomes data for its placement programs more accurately, spotting those programs that are relatively ineffective and improving or eliminating them. The agency has greatly refined its ability to define and evaluate the characteristics of its unemployed and partially employed customers. As a result, it has developed a segmented approach that helps the agency offer more effective placement and counselling to more carefully targeted customer segments. Surveys of their customers show that they perceive and highly approve of the changes it is making. User Impact of Application Scenario: high, (It has reached to goals, reduce the cost of the service, and provide a better service to the users, as now they are able to find a new job in a shorter period of time). Maturity of Application Scenario: Implemented. Financial Impact: How much?: saving of 10 billion yearly in costs For whom?: The public Labour agency. N/A Historical Data on costumers Public employment services Select one of the three categories (provide rational for classification) Basic Analytics (e.g. monitoring, reporting, statistics) Mature Analytics (e.g. data mining, machine learning) Advanced Analytics (e.g. prediction, devise) Data acquisition: yes, for the periodical addition of new historical data Data analysis: yes, for segmentation purposes of historical data Data curation: n/d BIG consortium Page 49 of 162

50 Data storage: n/d Data usage: yes, to compare analysed data with the information form the unemployed worker Sources (McKinsey Global Institute, 2011) 2.4. Conclusion and Next Steps So far, with the information collected about Big Data in the public sector, it can be said that it is more or less aware of the potentials of these technologies, but the path to success is not currently clear due to some uncertainties, the most important of which are: Big Data technology is immature. Lack of skilled people. New European directives about Data protection and PSI to be approved in the next one to two years display some uncertainties about the impact on the implementation of Big Data and Open Data initiatives in the public sector. Specifically, Open Data is set to be a catalyst from the public sector to the private sector to establish a powerful data industry. It needs to gain momentum. Today, there is more marketing around Big Data in public sector than real experiences from which to learn which applications are more profitable and how it should be deployed. There are many bodies in public administration (especially in those which are widely decentralized), so much energy is lost and will remain so until a common strategy is realised for reuse cross technology platforms. The next steps in relation to the collection of public sector requirements will be the collection of requirements based of workshops for public sector officials. The first one took place on the16 th of April in Madrid where the Spanish public sector was invited. Two additional workshops are foreseen in other European countries. Detailed information about the conclusions from these workshops will be provided in the coming reports for public sector Abbreviations and acronyms DG EC EU GDP ICT OECD PSI ROI Directorate General European Commission European Union Gross Domestic Product Information and Communication Technologies Organisation for Economic Co-operation and Development Public Sector Information Return of Investment BIG consortium Page 50 of 162

51 2.6. References 1105 Government Information Group. (n.d.). The chase for big data skills. Retrieved March 26, 2013, from GCN.com: Ashford, W. (2012, January 2012). Big changes expected as EC publishes data protection review. Retrieved April 10, 2013, from computereekly.com: Bossaert, D. (2012). The impact of demographic change and its challenges for the workforce in the European public sectors. European Institute of Public Administration (EIPA). Boyer, K. (n.d.). Sentiment Analysis. Retrieved March 25, 2013, from DMGFederal.com: Correia, Z. P. (2004). "Toward a stakeholder model for the co-production of the public-sector information system". Information Research, 10(3) paper 228. Retrieved February 27, 2013, from InformationR.net: Cottingham, R. (2010). Greatest hits: Facebook and social networking. Retrieved April 18, 2013, from Noise To Signal: EPSIplatform. (2013, April 11). The EU Endorses a New PSI Directive. Retrieved April 18, 2013, from epsiplatform.eu: European Commission. (1998). COM(1998)585. PUBLIC SECTOR INFORMATION : A KEY RESOURCE FOR EUROPE. GREEN PAPER ON PUBLIC SECTOR INFORMATION IN THE INFORMATION SOCIETY. European Commission. Hunton & Williams LLP. (2013, April 9). Article 29 Working Party Clarifies Purpose Limitation Principle; Opines on Big and Open Data. Retrieved April 18, 2013, from huntonprivacyblog.com: McKinsey & Company. (2011). The public-sector productivity imperative. McKinsey & Company. McKinsey Global Institute. (2011, June). BIG data: The next frontier for innovation, competition, and productivity. McKinsey & Company. OECD. (2006). DSTI/ICCP/IE(2005)2/FINAL. DIGITAL BROADBAND CONTENT: PUBLIC SECTOR INFORMATION AND CONTENT. Organisation for Economic Co-operation and Development. Oracle. (2012). Big Data: A big Deal for Public Sector Organizations. Oracle. TechAmerica Foundation. (n.d.). Big Data Can Save Money and Lives Say Government IT Officials. Retrieved April 15, 2013, from TechAmerica Foundation: Big-Data-Report_FINAL-2.pdf The European Parliament and the Council of The European Union. (2003, November 17). Directive 2003/98/EC of the European Parliament and of the Council of 17 November 2003 on the re-use of public sector information. Official Journal L 345, 31/12/2003 P Brussels: The European Parliament and the Council of The European Union. The White House. (2012, March 29). Big Data is a Big Deal. Retrieved January 18, 2013, from The White House: World Economic Forum. (2012). Big Data, Big Impact: New Possibilities for International Development. Geneva: The World Economic Forum. BIG consortium Page 51 of 162

52 Yiu, C. (2012). The Big Data Opportunity. Making govenrment faster, smarter and more personal. London: Policy Exchange. BIG consortium Page 52 of 162

53 3. Finance and Insurance 3.1. Introduction Definition of Big data in Finance The finance and insurance sector by nature has been an intensively data driven industry for many years, with financial institutes having managed large quantities of customer data and areas such as capital market trading for example, having used data analytics for some time. The business of insurance is based on the analysis of data to understand and effectively evaluate risk. Actuaries and Underwriting professionals depend upon the analysis of data to be able to perform their core roles, so we can safely state that data is a dominant force in this sector. There is however an increasing prevalence of data which falls into the generic definition of Big Data, i.e. high volume, high velocity and high variety information assets1 and this is born out of the advent of new customer, market and regulatory data surging from multiple sources, ever increasing in volume. To add to the complexity, is the co-existence of Structured and unstructured data. Unstructured data in the financial services and insurance industry can be identified as an area where there is a vast amount of un-exploited business value. For example, there is much commercial value to be derived from the large volumes of insurance claim documentation which would predominately be in text form and contains descriptions entered by call centre operators, notes associated with individual claims and cases. With the help of Big Data technologies not only can we more efficiently extract value from such a data source, but we can analyse this form of unstructured data in conjunction with a wide variety of data sets to extract faster, targeted commercial value. An important characteristic of Big Data in this industry is Value how can a business not only collect and manage Big Data, but how can the data which holds value be identified and how can organisations forward-engineer (as opposed to retrospectively evaluate) commercial value from this Benefits, Advantages and Impact The advent of Big Data in financial services can bring numerous advantages to financial institutions. Benefits which come with the greatest commercial impact are highlighted as follows: Enhanced levels of customer insight, engagement and experience With the digitization of financial products\services and with the increasing trend of customers interacting with brands or organizations in the digital space, there is an opportunity for financial services organizations to enhance their level of customer engagement and proactively improve the customer experience. Many argue that this is the most crucial area for financial institutes to start leveraging big data technology to stay ahead or even just keep up with competition. To help achieve this, Big Data technologies and analytical techniques can help derive insight from newer unstructured sources such as social media. Customers are increasingly making their likes and dislikes known in the digital space, interacting with organizations online, expressing their experience with a particular organization within their digital sphere of influence. Big Data helps banks use social listening to understand customer sentiment in near real time; this creates the opportunity to map this information with data sets already available to them to grow their engagement with customers. For example, customer retention or reducing customer BIG consortium Page 53 of 162

54 churn has been an issue for many banks globally. To implement an effective customer retention strategy and identify the early signs of customer churn, banks need a full view of customer interaction across multiple channels such as branch visits, internet transactions, mobile banking actions, social media actions etc. This will allow banks to detect early warning signs such as reduced interactions and allow them to take specific actions to enable customer retention and prevent revenue losses. However, one of the challenges today for financial services organizations is obtaining this complete view, relevant customer information is generally held across functional silos which makes it very difficult to detect early warning signs and carry out corrective actions in near real time. As a result, strategies are formed based on basic pieces of incomplete information rendering them less effective. Big data technologies help banks solve this data management challenge by analysing the massive volume and variety of data and allow access real time customer interactions that are more likely to provide early warning signs as opposed to providing a retrospective analysis. Additionally, the availability of sophisticated data matching capabilities can facilitate the elimination of data silos and provide the desirable complete picture of customer interaction. Big data does not only facilitate customer retention, but it can also generation of additional revenue streams from existing customers and customer acquisition. This is provided of course, if the right questions are asked of the data. For example: What products sell well together and what products are customers most likely to purchase in a package? What campaigns and offers have the highest chance of success and what types of customers should a promotion be targeting? Can a bank forecast call centre staffing levels based on the expected response from marketing campaigns? These questions are by no means new to the industry, but with big data enabling organizations to process huge, variable data sets, these questions can be answered with more precision and faster than was previously possible. This translates to highly targeted marketing campaigns, increased relevance in communication to customers and real time customer interactions. Big Data technologies and techniques can then also be utilized to gauge the success of such activities by retrospectively monitoring results. This enables organization to quantify the real commercial benefit of their investment in big data technologies Enhanced Fraud detection and prevention capabilities Financial services institutions have always been vulnerable to fraud. An unfortunate fact is that every day there are individuals and criminal organizations working to defraud financial institutions and the sophistication and complexity of these schemes is evolving with time. In the past, banks analysed just a small sample of transactions in an attempt to detect fraud. This could lead to some fraudulent activities slipping through the net and other false positives being highlighted or non-fraudulent activities being blocked. Utilisation of Big Data has meant these organisations are now able to use larger data sets to identify trends which indicate fraud to help minimise exposure to such a risk. For example, if a user has travelled to London and their credit card, which they do not use often, is swiped in Manchester, there is a chance the traditional tracking algorithms will not pick this up as the bank has a limited view of this customer. Also consider the only parameter being used to track the fraud in this case would be the user s credit card transactions. Now if the bank is able to leverage public social networking data, they could pick up on the fact that the user is in London and block this fraudulent transaction in Manchester. This example demonstrates how Banks can get a more complete view of the consumer based on, not just the credit card or enterprise data, but also based on location and social data. The BIG consortium Page 54 of 162

55 data from multiple channels means banks can get a single view of their customer and update their fraud protection practices to be more effective. Enhanced Market Trading Analysis Trading the financial markets started becoming a digitized space many years ago, driven by the growing demand for the faster execution of trades, which still exists today. Trading strategies which make the use of sophisticated computer algorithms to rapidly trade the financial markets are a major benefactor of big data. Market data can be considered itself, as big data. It is high in volume, it is generated from a variety of sources and it is generated at a phenomenal velocity. However, this big data does not necessarily translate into actionable information. The real benefit from big data lies in effectively extracting actionable information and integrating this information with other sources. Market data from multiple markets and geographies as well as a variety asset classes can be integrated with other structured and unstructured sources to create enriched, hybrid data sets. This provides a comprehensive and integrated view of market state and can be used for a variety of activities such as signal generation, trade execution, P&L reporting and risk measurement, all in real time hence enabling more effective trading Stakeholders: Roles and Interest The identification and Roles of stakeholders will be elaborated in the next iteration of this chapter after further market analysis Drivers of Big Data in Financial Services and Insurance Highlighted below are four broad industry drivers that accelerate the need for Big Data technologies and management techniques in the Financial Services and Insurance industry: Data Growth Perhaps the most obvious driver is that financial transaction volumes are growing leading to data growth in financial services firms. In Capital Markets, the presence of electronic trading has led to a decrease in the value of individual trades and an increase in the number of trades. The advent of high turnover, trading strategies generates considerable order flow and an even larger stream of price quotes. Data growth is not limited to capital markets businesses. The Capgemini/RBS Global Payments study for 2012 estimates that the global volume for electronic payments is about 260 billion and growing between 15 and 22% for developing countries. As devices that consumers can use to initiate core transactions multiply, so too do the number of transactions they make. Not only is the transaction volume increasing, the data stored for each transaction are also expanding Increasing scrutiny from Regulators Regulators of the industry now require a more transparent and accurate view of financial and insurance businesses, this means that they no longer want reports; they need raw data. Therefore financial institutions need to ensure that they are able to analyse their raw data at the same level of granularity that regulators will be. They need to be able to identify any issues before the regulators do. This means, financial institutions in fact have no choice, but to deal with Big Data. BIG consortium Page 55 of 162

56 Advancements in technology means increased activity. Thanks largely to the digitization of financial products and services, the ease and affordability of executing financial transactions online has led to ever-increasing activity and expansion into new markets. Individuals can make more trades, more often, across more types of accounts, because they can do so with the click of a button in the comfort of their own homes. Instead of visiting a broker to carry out to obtain insurance quotations, users are able to do this on the move with their mobile device. Increased access and ease of use translates into increased activity, which in turn translates into rapidly growing data volumes Changing business models: Driven by the aforementioned factors, financial institutions find themselves in a market that is fundamentally different from the market of even a few years ago. Successful organizations must be able to quickly apply changes and build agility into their business models or risk losing market share and the confidence of customers. Adoption of big data analytics is also necessary to help build business models for financial institutions geared towards retention of market share from the increasing competition coming from other sectors. For example ecommerce businesses have leveraged big data to start supplying consumer credit, if financial institutions do not adapt their business models, they risk losing their consumer base to organizations who traditionally do not provide financial services Challenges that need to be addressed A number of challenges still need to be addressed in the finance and insurance industry to facilitate the adoption of big data technology tools and techniques. The two main challenges which have been found to resonate with most financial services and insurance organizations analysed are highlighted below: A lack of skills. A common theme can be noted across the industry regarding the challenge of having access to the right level of skills. Organisations have recognised the data and the opportunities the data presents; however they lack human capital with the right level of skills to be able to bridge the gap between data and potential opportunity. The skills which are missing are those of a typical data scientist. A data scientist can be identified as having three keys skills: Commercial - The ability to translate business problems into technology solutions Analytical Strong statistical/analytical skills with a background particularly geared towards unstructured data mining. These are not to be confused with the skills of a typical statistician. Technical Strong scientific or technical skills for example to be able to write scripts and really extract the core value from data. These skills do exist in isolation in the industry, but in depth re-skilling is required to produce the human capital who can really extract value from Big Data. Some financial services institutes have sought the skills from outside their own organization by partnering with supposed specialists in this area; however an observation can be made to suggest resources possessing all three core skills are not in abundance. BIG consortium Page 56 of 162

57 Data Actionability The next main challenge can be seen in making big data actionable. As mentioned earlier, big data technology and analytical techniques enable financial services institutes to get deep insight into customer behaviour and patterns, but the challenge still lies in organizations being able to take specific action based on this data. A hypothetical example, if an organizations big data can identify in real time a specific target group of customers who are 90% likely for the uptake of a new offering, but the organization does not have the technology and the processes to be able to leverage that information in real time, the information is rendered of no use. So the challenge in organisations is still integrating intelligence from big data into operations Role of Regulation and Legislation Regulation appears to be a primary external influencer of Big Data and analytic trends. Increased regulatory uncertainty, regulatory pressures, and global business demands are also forcing financial services firms to rethink the value of the technologies, data management, and business processes they use to operate effectively, compete, and manage risk. Stringent regulatory compliance laws have been put in place to improve operational transparency. Financial services organizations are held much more accountable for their actions, and are required to be able to access years of historical data in response to regulators requests for information at any given time. These records must be available on demand, or in some cases must be normalized and sent to regulators proactively. Partly because of these pressures, financial services companies have realized that the key to optimizing their business operations is in maintaining an efficient and large-scale data management infrastructure. In other words, the adoption of big data technologies and techniques could be the most efficient way to meet regulatory constraints around data Big Data Application Scenarios Application scenarios will be elaborated in the next iteration of this chapter following further market analysis Conclusion and Next Steps In Conclusion, there is no doubt Big Data has a very relevant applicability in financial services and insurance. However more can be done, and is being done, to facilitate the adoption of Big Data technologies and techniques in order for organisations to truly benefit from their big data. There are clear challenges along the way and the next steps are to delve deeper into those challenges, understand more about stakeholders roles and responsibilities and produce a detailed market analysis showing the potential impact of Big Data in this market place. BIG consortium Page 57 of 162

58 4. Telco, Media & Entertainment 4.1. Introduction In the last decades, enormous technological changes have been shaping the Telco and Media industries. Telecom networks have evolved, new networks have replaced or complemented old networks and new services have emerged. Smartphones have reinvented the very concept of the telephone; telephony is now a commodity that comes as part of a device, alongside internet connections, software applications as apps and integrated services. Big Data enables telecom operators to explore, interpret and then benefit from the wealth of data that is generated between customers and their networks and systems. Other players such as IT providers will guide the process of adoption of this new technology and will provide the tools to bring operators and customers closer in domains such as, e.g., customer care, commercial offering or network monitoring. Media too has been transfigured by the velocity of technological changes. User-generated content (UGC), digitalisation and piracy have all increased the volume and availability of content while putting pressure on its perceived value. In short, the media must do more with less. Intelligent management and creation of media content will be essential as the volumes of media content will inevitably increase and the sources of this content become more disparate. Big Data is an opportunity for the Media industry to understand the issues concerning the vast amounts of content it creates. These issues include scalability, discovery, management, creation, distribution and association and linking of multimedia content. It also enables the opportunity to test the long-term feasibility of cutting-edge approaches such as semantic technologies. The BIG project is analysing the requirements and identifying the technology gap, with the aim of establishing a roadmap for the use of Big Data in the Telco and Media industries. This deliverable has four main components. The first part sets out the main aspects of the industry that are most salient to the adoption of Big Data technologies. Secondly, the document identifies a range of application scenarios where Big Data has the potential to transform processes and markets. Thirdly, known business requirements, and identified areas for further investigation, are captured under the headings of the five technical working groups of the BIG project. Finally, a conclusion brings together the themes of the preceding sections to help focus future discussions between the Sector forums and the Technical working groups Industrial Background Telecom The telecom industry has experienced more change in the last 15 years than in its entire history. The current section gathers main changes and current trends in the telecom sector. In this section, we try to analyse the context of the telecom market, its particular characteristics and the main trends that can be observed in order to analyse the framework in which Big Data joins. BIG consortium Page 58 of 162

59 Characteristics From monopoly to free market The telecom industry has shifted from a nation-based monopolistic industry to a globalised free market system, which has altered the market behaviour for some time. Changes in the investment on infrastructure, increasing competition or pricing models evolution are some of the aspects that can be observed. Competition goes in every direction. According to IBM (IBM), infrastructures will be increasingly provided by public entities such as governments and municipalities. For example, public Wi-Fi hotspots on public transport all over Europe are increasingly popular. There are other examples, such as, for example, local investment by housing associations on local access networks for building fibre-to-the-home broadband access networks (Amsterdam). The active role of governments is important for the technological development of Europe. This trend does not only relate to investments in research and education, but also to investments in infrastructure and coverage to catalyse technological development in the European industry Cloud computing As stated in (WEF), in the past few years, the boundaries between information technology (IT) which refers to hardware and software used to store, retrieve, and process data and communications technology (CT) which includes electronic systems used for communication between individuals or groups have become increasingly indistinguishable. The rapid convergence of IT and CT is taking place at three layers of technology innovation cloud, transmission channel (pipe) and device. As a result of this convergence, industries are adapting and new industries are emerging to deliver enriched user experiences for consumers and enterprises. Before the transformation of CT into ICT, real-time voice services played a dominant role in telecommunications. At that time, the telecom industry focused on finding solutions that empowered customers to roam with their mobile phones over a mobile network with acceptable price for both the carrier and the subscriber. Since the transformation of CT into ICT, media services and breaking news are widely available via mobile networks and have replaced the previously dominant real-time voice services in CT. Today, the telecommunications industry focuses on the customer need for seamless services supported by integrated mobile networks. The very meaning of the word pipes has also changed. Although the term still refers to a data connection with a pipe being analogous to bandwidth or throughput it has evolved from a physical connection, such as a cable, to all Internet protocol (all-ip) networks. Likewise, when telecommunications companies referred to networks in the past, they meant connected networks. Today, the same word also refers to data transmitted via ICT. Moreover, the rise of cloud computing will have an impact on telecom providers. Besides the need for more bandwidth, it may further speed the integration of IT and telecom Wireless everything Users want to have wireless access to everything. The installed base of smartphones exceeded that of PCs in 2011 and is growing more than three times faster than PCs (WEF). Looking forward, approximately 4 billion smartphones are expected to ship between 2011 and 2015, clearly establishing them as the most pervasive computing and Internet access device today and in the future. The introduction of smartphones changed the concept of telephone. A smartphone is a small personal computer that comes with telephony. Besides, the dramatic increase in application stores has reinvigorated the market for mobile applications. BIG consortium Page 59 of 162

60 Moreover, the increased traceability of social networks which are mostly accessed from mobile networks can enhance the ability to extract actionable insight by analysing their form, distribution, and structure through digital media. Consequently, an enormous potential to generate important insights and innovation exists within the social sciences through an improved understanding of spatialised social networks (i.e., place-based analyses of social network structures over time) M2M and Internet of Things Mobile communication between objects, machines or sensors has led to the growth of M2M connections. M2M technologies are being used across a broad spectrum of industries: smart metering and utilities, maintenance, building automation, automotive, healthcare, consumer electronics, etc. M2M applications mostly generate data in a different way than humans: each device produces a small amount of data during a short period of time. The Internet of Things and M2M technology rely on service platforms but also on the internet and, of course, on the networks. It takes devices and sensors, radio access network, a gateway, a core network and a backend server for devices to communicate autonomously. This is why mobile telecom operators see an important source of revenue in the coming years. BIG consortium Page 60 of 162

61 Increasingly advanced large-scale M2M applications require advanced service enablement platforms that integrate remote devices, mobile networks and enterprise applications. According to Cisco (Cisco, 2013) bandwidth-intensive M2M connections become more and more prevalent. Among various verticals healthcare M2M segment is going to experience the highest CAGR at 74% from 2012 to 2017, followed by the automotive industry at 42% CAGR Fragmentation There is a high level of fragmentation at all levels within the value chain in the telecom market: different devices (with different operating systems in the case of mobile smartphones) to access to different services (voice, data, IMS, chat, etc.) over heterogeneous networks (fix, mobile, etc.). Device segmentation implies several challenges. Whatever the device, the user should be provided the same service. This applies not only to fixed and mobile devices but also to different mobile handset models, which might require different versions of the same application depending on the operating system. As far as networks are concerned, we can find fixed and mobile networks. Within wireless, there are services based on circuit switching and on packet switching. The challenge is to implement services that are valid at the same time for all network technologies. Fragmentation goes beyond technology. The telecom market has corporate and residential segments, with very different needs and solutions. Moreover, customers subscribing to different services (e.g. fixed and mobile) with the same operator, rarely get a unified bill because of IT systems fragmentation. Figure 13 illustrates fragmentation in the telecom market. Figure 13: Fragmentation in the telecom market Commoditisation The traditional services offered by the telecom sector have largely become commodities, especially standard services like telephony and data communication. The pressure commoditisation puts on the margins of service providers has pushed telecom operators to pursue different commercial strategies (low-prices, flat rates, subsidised devices, etc.) for undifferentiated services. Besides, telecom operators and OTT players (Over the Top, e.g. Google) have been for some time now trying to win a commercial battle whose end is still not clear. Telecom operators perceive OTT are taking advantage of their infrastructure while OTT argue that their services help operators keep their customer base. Something to find out during the BIG project is BIG consortium Page 61 of 162

62 whether Big Data could help telecom operators become something else than dumb pipes serving OTT players Slow growth The telecom market is slowing down after several years of important revenue growth. According to the European Union, domestic revenue growth for most European carriers was negative in 2011 although some operators were able to experience some revenue growth in overall revenue thanks to the diversification of their businesses in emerging markets (European Commission, 2012). The introduction of flat tariffs has produced a situation in which costs no longer match revenues: traffic increases but this is not translating into proportional revenue. According to IBM (IBM), the ever-increasing amount of data over a network designed for voice and light-weight traffic has made traffic and revenue diverge Shared costs Infrastructure sharing is one of the trends related to cost sharing in the telecom market. There are several examples of telecom providers having launched joint ventures consisting on joining their efforts to share their network equipment (Thomas, 2012). Telecom service providers can share infrastructure in many ways, depending on telecom regulatory and legislation. Some examples of infrastructure sharing are listed below: Passive infrastructure sharing: including base stations, antennae, cables and other sorts of non-electronic infrastructure. Active sharing, including electronic infrastructure. Spectrum-sharing, adopting roles of virtual network operators or even sharing the frequency (QoSMOS) for different crossed wireless technologies or even for the same technology. In particular, the Mobile Virtual Network Operator (MVNO) has emerged as one of the most influential models in the telecommunications landscape. From the point at which it became possible to decouple the provision of differentiating telecommunications services from the ownership of either network infrastructure or Radio Access Network (RAN) allocation, the future viability of the MVNO model was assured. The rise of the MVNO model allowed MNOs to question their own core business. Where previously ownership of the network was seen as something to be fiercely guarded, the new model acknowledged that opening up the ways in which the network could be put to use by third parties could lead to completely new streams of revenue. To understand the new MVNO business models, it is important to understand the central role of IT the success of any MVNO-based business depends of the efficiency and innovation with which technology is used. It is the technology which enables third parties both to access the telecommunications core and to develop new application-driven services. Mutually profitable revenue and cost sharing between multiple partners can only be assured if industrial-strength applications and collaboration infrastructure are in place. For the MVNO, the MNO-owned option also raises concerns about lock-in. The future state of MVNO business will be characterised by highly agile partnerships, capable of forming and reforming as conditions and opportunities change. With the degree of volatility, a third-party MVNE (Mobile Virtual Network Enabler) platform can be seen as promising lower risk and greater agility. This agility becomes key as MVNOs seek to broker deals with multiple providers including, in addition to MNOs, cable and fixed-line players. This provides a powerful incentive BIG consortium Page 62 of 162

63 to establish the capability of different partnership combinations without the need to re-engineer the MVNE platform. There are now enough examples showing how models which decouple service from network ownership create wealth, expand entrepreneurial opportunity, and provide greater range and choice of consumer services Regulatory framework Regulation in the telecom market is meant to protect customers and to foster a reasonable competition. Of course, even though regulation is necessary in order to avoid abusive tariffs or, in particular as far as data is concerned, mitigate customers privacy concerns (they expose their personal data online), it has a clear impact on revenue for telecom operators. For example, in terms of pricing, the forecast is that both mobile and fixed termination rates will continue to decrease according to the EU recommendations (Europa Press Releases Rapid, 2009). As far as net neutrality is concerned and according to the European Commission (EurLex, 2011) much of the neutrality debate centres around traffic management and what constitutes reasonable traffic management. It is widely accepted that network operators need to adopt some traffic management practices to ensure an efficient use of their networks and that certain IP services, such as for instance real-time IPTV and video conferencing, may require special traffic management to ensure a predefined high quality of service. However, the fact that some operators, for reasons unrelated to traffic management, may block or degrade legal services (in particular Voice over IP services) which compete with their own services can be considered to contradict the open character of the Internet. Transparency is also an essential part of the net neutrality debate. All this means that telecom operators will have to accommodate their innovation, also around Big Data, to this regulatory framework in order to find new sources of revenue. After all, it s their customers personal data travelling up and down these networks and devices Software productivity metrics The telecom sector (and others) are increasingly concerned about controlling IT costs. It is common that software in BSS domains is analysed using tools and methodologies to measure its functionality and its quality from the technical and functional point of view. Big Data platforms and related software imply a change in databases (from relational to non-relational) and data processing. Existing software productivity tools (such as Function Points methodology) must ensure that telecom providers will still be able to keep on doing what they have finally succeeded to do after 20 years: measure the software in order to control and manage their IT costs Telecom stakeholders The study of Big Data for Telecom requires to cover the whole telecom business landscape. There are different stakeholders taking part in OSS/BSS information systems producing data that needs to be stored, retrieved and correlated in order to make more business. The first step for the Telecom SF is to identify contacts we may have along this landscape that could provide us with first hand requirements for Big Data in the telecom sector. In order to identify and classify our contacts, the landscape must be drawn first. There are several approaches to get a picture of all the actors involved in the Telecom value chain. One of them is to separate telecom core processes from non-core ones. For each, we can identify different stakeholders covering different roles in the telecom business. BIG consortium Page 63 of 162

64 The following picture provides an idea of this approach and constitutes an exercise for identification of actors. Figure 14: Telecom landscape first approach for identification of actors. However, knowing there is a specific standard framework for business processes design and deployment for telecom (etom), the most suitable means to classify roles is to lean on it for our research. The enhanced Telecom Operations Map etom is a guidebook built on TM Forum Telecom Operations Map (TOM). Currently, etom is the most widely used and accepted standard for business processes in the telecommunications industry. The etom model describes the full scope of business processes required by a service provider and defines key elements and how they interact. Among its advantages we can mention that it establishes a common vocabulary for both business and functional processes. The Framework enables to map the business processes into a language that all parts of an organisation can understand, thus supporting a business-driven approach to manage enterprise processes. This seems to be a right approach as far as Big Data analysis for the telecom sector is concerned because etomprovides a reference framework in which we can categorise all business activities at all levels of the telecom industry. The first step is to focus on Level 0, where the highest sight is achieved. The following diagram places some stakeholders using the etom level 0 framework. BIG consortium Page 64 of 162

65 Figure 15: etom-based identification of players In this picture, no distinction is made between Strategy, Infrastructure and Product and operations. This is something to be analysed in more detail at level 1 with some of the stakeholders represented in this picture. Relevant actors in the telecom sector are listed below: Telecom operators Virtual Telecom operators System Integrators Network equipment vendors Device manufacturers Marketing 2.0 companies Regulatory bodies responsible for establishing the legal framework European Commission: o DG Informatics (DIGIT): According to its mission statement its goal is to enable the Commission to make effective and efficient use of Information and Communication Technologies in order to achieve its organisational and political objectives o DG Communications Networks, Content and Technology (CNECT): According to its mission statement this DG helps to harness information & communications technologies in order to create jobs and generate economic growth; to provide better goods and services for all; and to build on the greater empowerment which digital technologies can bring in order to create a better world, now and for future generations. BIG consortium Page 65 of 162

66 Besides, the etom Business Process Model can be complemented with SID (Shared Information/Data), which provides an information/data reference model and a common information/data vocabulary from a business entity perspective. Figure 16: etom SID model This tool can be used to technically classify the data during the requirements elicitation phase. Figure 17: Detailed data classification in etom SID model BIG consortium Page 66 of 162

67 Nature of the data As far as data is concerned, the telecom sector has experienced a huge data volume growth in the last few years mainly due to the following drivers: Increase of popularity of mobile internet services which generate data on the devices, networks and systems: o Smartphone shipments increasing every year (by 49% in 2012 according to Gartner (Gartner, 2012). o Global mobile data traffic grew 70% in 2012, according to Cisco (Cisco, 2013). Context information is increasingly used. Smart devices are physically integrated in space and this also contributes to the generation of data (e.g. accelerometers register information every time the device moves). Rise of social networks used to upload messages, pictures, videos and all sorts of files. Mobile applications downloads. These apps are not producing revenue by themselves (since their cost is usually very low), but they drive hardware sales, advertising spending and technology innovation. And they produce lots of data. Cloud services in expansion, bringing more remote storage and processing, which is also related to a greater production in traffic across networks. Telecom operators have access to varied data that come from different sources. Here is a possible classification of data according to the level of complexity in the process of gathering this data: Volunteered data: This is natural data customers provide for registration. o Examples: Name, address, date of birth, gender, profile, preferences, number, SIP address, IP address, SIM, Soft SIM, serial number, device details, subscription date, job, marital status, voice recordings for product subscription, etc. An operator with a customer base of 40M subscribers aprox. (including interested customers not having activated a contract yet, prepaid, postpaid, etc.) requires minimum 500GB to store this basic information only for mobile. Observed data: This data is retrieved from the service usage and is based on information at customer care, billing systems, network, etc. Some of these data needs to be stored for a certain period of time according to national laws (e.g. bills). o Examples: billing information, internet access, call origin and destination, reason of session failure, service delivery completeness, service used, complaints reported, products purchased, locations (GPS or cellular), on/off roaming, etc. Cisco has forecasted that between 2011 and 2016 global mobile internet data traffic will increase by a factor of 18 (to 10.8 exabytes per month) and global IP traffic will reach exabytes per month by 2016 (Cisco, 2013). A telecom operator might produce 1TB of raw packet-switched signal data per day and store it in a file format. After processing, 550 GB xdr signal data is generated per day and is saved in a database format. This data is often saved for a few days or a few months (ZTE, 2012). For example, according to (DigitalRoute, 2012), in a mobile network, there are typically online transactions between network gateway nodes and online charging or mediation systems for charging purposes. However, these volumes only BIG consortium Page 67 of 162

68 represent a fraction of the usage data from the network, which primarily comes in the shape of records that are transferred at discrete intervals. It is common for network operators to set these intervals to 10 to 15 minutes and batch process data at off-peak hours. For end-user communication, this is not an acceptable solution. Customers demand accurate information about their usage through, for example, SMS, IVR, smartphone applications or USSD. Sending these notifications and alerts in a timely manner requires that usage data is continually collected at short intervals. With mobile data speeds of eight to 20 Mbit/s, reaching a 1GB allowance within minutes is a practical reality. According to (Cisco, 2013), the biggest gain in share will be M2M (5% of all mobile connections in 2012 to 17% in 2017) and smartphones (16% of all mobile connections in 2012 to 27% in 2017). The highest growth will be in tablets (CAGR of 46%) and M2M (CAGR of 36%). Inferred data: This requires some analysis on volunteered and observed data. o Examples: most used IP address, preferred contact channels, last known location, destinations visited frequently, etc. By analysing user profiles, product packages, services, billing, and financial information, operators can obtain precisely control policies. Web pages, messages, pictures and movies, and other traffic delivered through the network can also be analysed to better understand user behaviour. A marketing portal provides daily and monthly statistical reports on data flow, revenue, subscriber development, warnings, and summary tree structure. The amount of data added each month can reach up to 4 TB. Usually, it takes 26 hours to analyse 4 TB of data using a traditional method that is inefficient and cannot adequately deal with system expansion. Social data: This is personal social data on social networks such as Linkedin, Facebook or Twitter. o Examples: number of followers, number of people following, influence rate per follower/followed, contact intensity, mood attributes, recommendations to and by others, school, lifestyle, opinions about products owned, opinions about customer service, opinion about telcos, etc. Social data creates huge amounts of pieces of information. For example, Facebook users post nearly pieces of content a minute. Of course, a telco player does not need to store and analyse every single operation in the social media. It would be interesting to be able to track all those interactions a telco customer has been involved in and try to focus on where there are telco conversations taking place. According to(#tcblog, 2011), more than 70% of relevant conversations in the telecom sector are in twitter and in forums, which clearly represents an opportunity. 3 rd party data: This involved data coming from business partners such suppliers (points of sale), insurance, airlines or banks, for example: o Examples: Bank information, credit cards, consumption habits, lifestyle, frequent flyer information, etc. The more complex the data, the greater business value that can be extracted from it. This is shown in the following figure: BIG consortium Page 68 of 162

69 Figure 18: Level of data complexity vs. business value Besides, the data in this sector present other characteristics: Multiple data formats: xml for bills, mp3 for voice recordings, text for customer complaints, proprietary formats for CDR collection, SIP messages, SS7, IP. Highly transactional: this implies a regular, high-volume stream of records entering the system. Multiple Devices causing high data generation rate Available Big Data Technologies Products and concepts around Big Data technologies are proliferating and some classification of products and features is suitable in order to understand their potential benefits and applications in the telecom market. The intention of this section is to provide a high-level introduction of available technologies for telecom domain readers to get inspired. For further technical details and for a more accurate classification and description of Big Data technologies, please refer to the technical working groups deliverables that the BIG project will produce (D2.2.1a-f, foreseen in Jan 2014). First of all, we have two main domains within Big Data technologies: Technologies to store data Technologies to process data The main trends for each discipline are described in the following subsections Data storage In the first category, we find the NoSQL concept as a new approach for storing data elastically and reliably. Unlike traditional storage data solutions, NoSQL solutions are not based on relational databases. Instead, they are unstructured and present different possibilities for storing and relating data. The most suitable NoSQL solution depends on the requirements and the nature of the data to be stored. Key-value stores. These distributed hash tables containing small key-value pairs. This is typically suitable for storing sensors data (M2M), for example. Voldemort and Membase are examples of available key-value stores. BIG consortium Page 69 of 162

70 Column-oriented data stores store data in columns. This is useful for storing a lot of information on only one row and when all the information contained in a table cell is going to be used. This is typically used by Google in order to store a complete copy of the internet. The most well-known solutions are Cassandra, HBase and Hypertables. Document-oriented data stores are also based on key-value stores, but the data associated is a structured document (e.g. an XML file). This can be used in the telecom sector for storing bills or network activation requests, for example. MongoDB is the most widely used document-oriented database. Graph-oriented databases are thought to store graph-structured data such as, for example, the social relationships of customers in a social media environment. The most relevant graph-oriented database is Neo4j. Of course, the choice of a solution depends on the specific needs for an operators and the strategy for the exploitation of Big Data chosen Data processing The storage of data must be complemented with the processing of this data in order to efficiently retrieve the required information out of it. In some cases, for searching (e.g. in the case of a search engine where answering requests requires the processing of collected data). In other cases, for data analysis, in order to extract relevant and valuable information from the stored data. The MapReduce framework was created by Google and gathers this philosophy. MapReducing means parallelising the execution of processes (Map) and merging the results afterwards (Reduce). In particular, Apache Hadoop, inspired on this Google concept, is an open source creation following this approach. This tool is demanding in terms of the knowledge required, which is why a plethora of related commercial realisations of Hadoop that hide its complexity have appeared: Cloudera, IBM, MapR or Greenplum (acquired by EMC) are some examples. Besides data storage and data processing, there is also a set of complementary products that help visualise, analyse, etc. For example, data dispatching tools towards historical storage of toward real-time systems or data usage-related tools such as visualisation tools, query builders, etc Big Data in the telecom sector Traditional business intelligence Relational database management systems (RDBMSs) have been the main and almost only means of data storage for a very long time. They are based on a firm theoretical background, and implementations, including Open Source implementations like MySQL and PostgreSQL, have reached high levels of maturity and performance. However, RDBMSs are not horizontally scalable. They have not been designed to run on clusters of servers on which to perform masses of processes in parallel. Their limits in terms of performance have been reached, so alternative solutions have to be found. Ways to distribute relational databases do exist (i.e. Oracle RAC, Sybase ASE Cluster Edition), but they come at a price in terms of features, data availability, performance, and administration cost. Besides, traditional datawarehouses in the telecom sector are not suitable for combining data coming from different operational areas, such as mobile, fix, TV. Different networks have been grouped together under a common operation but the truth is they have independent architectures that are very difficult to exploit by aggregating pieces of universes coming from BIG consortium Page 70 of 162

71 different sources. Besides, sometimes, some business intelligence systems are very dependent on the billing system, which is most usually different for mobile prepaid, postpaid and fix services. Billing systems gather many important information data required at datawarehouses as far as customer and usage data are concerned but dunning, payments, CRM and other systems are also crucial for gaining complete insight of the customer behaviour. Even for one simple area (e.g. mobile postpaid), the information is distributed across different and heterogeneous systems that have to be synchronised (CRM, rating, billing, dunning, service platforms, etc.) It is obviously very hard for business intelligence systems to gather the real status of service activation when several platforms are involved in the process. Moreover, as for unstructured data are being created in blogs, tweets, video, audio, , click streams, posts on Facebook, LinkedIn or company or consumer forums, notes within customer service or sales applications are currently not jointly available today. Another common characteristic of traditional business intelligence systems is the lack of real time. They are commonly based on batch processing of pieces of information from different sources. It usually takes hours to capture data into a data warehouse before it can be analysed. Performing context-based analysis on large volumes of data in minutes or seconds, based on a consumer s activity over the last several minutes is impossible with current systems. Most datawarehouses today perform analysis on old data, which is useful, but does not enable the real-time analytics that companies need. Companies that want to truly benefit from big data must integrate traditional corporate business intelligence systems with these new types of information and extract the value by fitting all these aspects into their existing business processes and operations State-of-the-art: are telecom companies thinking about Big Data? According to (SAS, 2012), the adoption rate of big data in the telecom sector is 37%. The Economist Intelligence Unit explored in March 2012 how far along companies are on their data journey and how they can best exploit the massive amounts of data they are producing (and not always collecting). This report, polling executives from around the world and across different sectors including telecoms, states that many companies are aware of the power of big data, but are not yet fully exploiting the data they collect. Moreover, it concludes that Big Data can only work out if a business puts a well-defined data strategy in place before it starts collecting and processing information (Economist, 2012). Obviously, investment in technology requires a strategy to use it according to commercial expectations; otherwise, it is better to keep current systems and procedures. Most generally, the required strategy might imply deep changes in business processes that must also be carried out. Many of these companies participating in this survey (Economist, 2012) were well advanced on their data journey. Asked how long they had been working with big data, 57% of survey respondents said that they had been doing so for at least 3 years. However, almost one-third said that they did not have the right skills in the organisation to manage data effectively and close to one-quarter said that vast quantities of data go untapped. This report found that companies that significantly outperform their peers are more likely to collect or plan to collect every kind of data that the survey asked about, from data generated by RFID tags to those from web-tracking technologies. When the survey asked executives to rate the value of data to different parts of their organisation, marketing and communications emerged as one of the most important uses. 73% of respondents rated data as important or extremely important to this function. BIG consortium Page 71 of 162

72 Figure 19: Most benefitted company departments according to executives (different sectors) Figure 20: Infrastructure readiness for main Big Data technical areas according to executives (different sectors) - Source Economist Intelligence Unit survey, March Following the business slowdown during the recent economic downturn, there has been a rethink of the kinds of mobile services and products that will provide sustainable returns for mobile network operators. Many operators have re-positioned themselves as access providers, recognising that their previous drive to become full-service content providers had in fact hindered their progress in selling data access services. It is the introduction of the open Internet, combined with flat-rate prices, which has now enabled operators to sell mobile data plans to their customers with open platforms for smartphones playing a critical role. However, this same shift to the open internet has opened the door to peripheral players in the OS, handset, and content markets to their customer bases, marginalising the operator to pure transport providers. This transition further commoditises their business and erodes their brand perception amongst their users, as strong brands like Apple and Google take over the user experience. Some operators are consciously giving up their central position in this space and are becoming dumb pipes pure transport players. This will ultimately limit their profit potential and transfer most of the opportunities for pure topline revenue growth to the rest of the ecosystem. For operators who wish to position themselves as central to the mobile data ecosystem, shifting towards new business models that takes advantage of their core capabilities like network excellence, customer insight and customer experience management being their distinctive assets represent the clearest opportunity to maintain brand position and revenue potential. The network is one of the most important asset telecom operators have. Along with voice, video and data packets, enterprise networks carry customer relationships, revenue opportunities, operational efficiency and expense control. Advanced network analytics offers insight into evolving customer behaviours, network utilisation and process adoption rates. In-depth drilling BIG consortium Page 72 of 162

73 will reveal hidden issues that can be corrected before they impact consumers and become costly problems. Big Data analytics for enterprise networks enable people throughout an organisation to make smarter decisions and increase profitability from both sides of the equation. According to a survey carried out by the European Communications Magazine (European Communications Magazine, 2012) in 2012 among 140 senior telco managers mainly from Europe, most of them believed that selling data to third parties will be the aspect that will generate the greatest benefit for the industry. Figure 21: What do you think is the biggest opportunity that Big Data presents operators with? - Source European Communications Magazine survey, March 2012 Another important figure from this survey is the fact that most telecom operator managers answered that Big Data is currently a strategic goal in their organisations: Figure 22: Is Big Data a strategic goal in your organisation currently? - Source European Communications Magazine survey, March 2012 When asked about the usage of the data they are currently collecting, the survey showed that most players consider they could do better in order to retrieve the maximum value from this information. BIG consortium Page 73 of 162

74 Figure 23: How well do you think you collate and analyse the data in your possession currently? - Source European Communications Magazine survey, March In the framework of the BIG project, a survey has been done in order to understand the state of the art as far as Big Data adoption is concerned. The survey can be found in annex 1 and has been distributed to more than 30 world-wide telecom players (mostly European operators but also manufacturers). Even though we expect to collect more answers in later phases of this project (only 20% of answers have been collected at the time this document is being written), we include here some interesting insights this survey has produced so far. For example, there are telecom players that are not yet into Big Data, and although they are a minority, it is a rather high percentage (28% of all those who answered). This will be probably better explained in the future as new answers arrive, since it is very likely that some of the queried players are indeed thinking of Big Data but did not answer because they consider this information to be confidential. As for our survey, this was the case of 14% of respondents. The rest of respondents answers mostly confirm what we have been gathering from existing literature, as shown in the following pictures. Only some relevant ones are shown here whereas others will be further analysed as more answers are gathered: BIG consortium Page 74 of 162

75 Figure 24: How aware is your company of Big Data business opportunities? BIG survey As shown in the previous picture, the majority of telecom companies are well aware of Big Data and consider it one of their strategic priorities. Figure 25: What are the most important key challenges you would face for adopting Big Data? BIG survey According to the previous picture, security seems to be one of the main issues to face. It is likely that as we get more answers, cost will be a more important factor. Figure 26: In your opinion, which areas in your company would benefit the most from Big Data? BIG survey Although we expected Marketing and Offer Management to be one of the areas that could retrieve the greatest benefits from Big Data, the survey indicates that it is the network. BIG consortium Page 75 of 162

76 Figure 27: Which of this data are you already collecting and which one do you plan to collect? BIG survey Telecom players seem to be planning to collect more data that they currently are. Figure 28: How much does your company intend to invest in Big Data R&D in ? BIG survey Data Analytics as a Service Another trend we can observe in the telecom landscape as shown in Figure 21 related to exploiting the data not only for own further business development, but as a merchandise that can be sold to third parties. Some operators provide companies and public sector organisations around the world with analytical insights that enable them to become more effective. This involves the development of a range of products and services using different data sets, including machine to machine data and anonymised and aggregated mobile network customer data (ComputerWorld, 2012). For example, as shown in (WEF), the analysis of a huge amount of traffic data by British Telecom helped identify communities and the level of intensity of commercial relationships between several geographical areas in UK. This also provides tools to measure the level of globalisation of an area ( how global this area is) depending on the number of international calls this area is involved in. Telecommunications quotient can provide a new way to explore the complex web of informational linkages among industrial actors: using a simple, anonymous BIG consortium Page 76 of 162

77 metric it becomes possible to assess the degree to which firms in a given area are engaged in international communication. In other words, the big data approach to telecommunications allows examine the fine-grained variations in how companies interact with one another, and with suppliers or clients around the world. In addition to highlighting important dependencies, we anticipate that this approach will help both firms and governments to monitor a rapidly changing regional economic landscape. Moreover, the study of changes in the telecommunications interactions of households and neighbourhoods can act as an early warning system of migrations or changing patterns of work. This can be used by governments to infer socio-demographic implications as this type of data can be collated and quickly associated with people, households or firms and be used to characterise them at any given point in time and not just every ten years. This information belongs exclusively to the operators and can be provided to third parties such as service or content providers, research institutions and enterprises. Operators can offer companies and public sector bodies analytical insights based on real-time, location-based data service. Mapping the movement of crowds can also help with city and transportation planning and can help retailers with promotions and choosing store locations. Another example of this trend is the Telefónica Dynamic Insight (Telefónica), a product by this telecom operator that can be used to represent crowd movement on a map along with sociodemographic data (for example, their age, etc.), for any commercial purpose. Location data can also be sold to third parties (cross-sectorial collaboration) such as cities or marketing companies for their own use. For example, a city town hall could buy anonymised location data to identify areas where drivers are regularly reducing their speed in order to indicate areas where accidents or problems exist. Location data can also be sold to retailers in order for them to offer promotional coupons based on their proximity to a company s supermarket. This, obviously, poses a threat to privacy, which means that the legal framework must be deeply studied (please refer to section for further information concerning regulatory aspects). Moreover, in the scenario of the Internet of Things (IoT), these issues become even more problematic. In the IoT, devices communicate automatically and autonomously with one another using Radio Frequency Identification (RFID). In order to protect users identity, there are a number of technical and requirements that have to be met, such as anonymisation, encryption, enablement/disablement of these features by end-users, assessment of users consent, etc. (Dresden University of Technology). Much of the data that will be aggregated will come from people who live in jurisdictions with privacy frameworks that are very different from European laws (which, in addition, are nationalwise currently). In addition, since online growth will continue outside of our boundaries, the customer base for Big Data services will increasingly be from foreign countries. These considerations mean that business models, products, and services might have legal constraints or create enormous reputational risk if not developed in a manner that accommodates them Big Data benefits for the telecom sector According to a report produced by Cebr, an independent economics and business research consultancy (SAS, 2012), in order to model the economic benefits of big data we can go through the following steps: 1. Identifying and quantifying the benefits of big data to business: this requires an understanding of the benefits of big data at the enterprise level, by assembling a framework to capture those benefits in terms of economic variables, namely business efficiency, innovation and creation impacts. This involves a review of relevant academic BIG consortium Page 77 of 162

78 and business literature on the potential enterprise-level impacts. The evaluation can be done in terms of five key characteristics which collectively represent the extent to which big data has the potential to transform its operations: data intensity, earnings volatility, product differentiation, supply chain complexity and IT intensity. The quantification of the economic impact can be measured as follows: a. Enterprise-level business efficiency gains from big data The impact on firm revenues and costs through the etom mechanisms: Market, Product and Customer, service, Resource and Suppliers. The quantification of the firm-level impacts through each mechanism can be based on a review of the academic and industry literature as well as questionnaires and interviews with relevant actors in the marketplace. b. Enterprise-level business innovation gains from big data The impact on firm innovation and, as a consequence, new product development through more efficient use of research and development. This can be quantified by modelling the effect of an increase in data-driven R&D expenditure and the knock-on effect on future long-term sales in the telecom industry as well as questionnaires and interviews with relevant actors in the marketplace. c. Enterprise-level business creation gains from big data The impact of reduced barriers to entry to new markets and technologies on SMB (small- and mediumsized business) creation. This can be based on quantifying the effects on the number of business start-ups as a result of the profits signals generated through the first two channels. 2. Determining current and prospective rates of big data analytics adoption: this requires an understanding of the major drivers of and inhibitors to the widespread adoption of big data analytics, in order to arrive at appropriate adoption forecasts. In undertaking this step, the evaluation of the sector-level potential of big data to transform business operations (as related in stage 1 above) has to be done. It is also interesting in conducting a review of the adoption rates anticipated by technology experts in the current business environment. 3. Calculate the aggregate economy-wide benefits of big data: this involves deploying the information from the first two stages into a macroeconomic model, in order to estimate the sum benefits across the European economy. The enterprise-level benefits and adoption rates can be calculated and aggregated for the whole telecom sector. The following sections provide insights concerning specific Big Data benefits in the telecom sector. Benefits have been mapped to etom processes. These benefits can be empowered by combining data of different planes (for example, customer data with mediation data, etc.). These benefits are obviously not quantified individually and the impact of big data in the telecom sector is hard to estimate. According to existing surveys, the majority of senior telco executives do not understand yet the potential that Big Data presents, which makes it difficult to setup a strategy (European Communications Maganize, 2012). However, the benefits within the telecom sector include not only revenue for the sector itself but also the social impact it can bring, for example, in number of jobs it is expected to generate. According to Gartner (Gartner Press Release, 2012), 4.4 million IT jobs will be required to support Big Data by According to the survey carried out by the European Communications Magazine (Magazine (European Communications Magazine Q2, 2012), it is clear that telco players do believe that Big Data should be exploited in the telecom sector: BIG consortium Page 78 of 162

79 Figure 29: Do you believe that Big Data should be a key strategic priority for operators? - Source European Communications Magazine survey, March Referencing the etom process again, the first analysis seems to point out that data produced in the Operations area can help the Strategy, Infrastructure and Product (SIP) domain. For a telecom operator, this means increasing the portfolio and value by leveraging the data produced by customers in the network and in the social media, as represented in the next picture. Figure 30: Big Data and etom BIG consortium Page 79 of 162

80 The combination of all these benefits within different etom domains can be summarised as the achievement of the operational excellence for telecom operators. Nowadays 1 operational excellence can be understood as providing true customer value through highly reliable products and services based on exceptionally good performance (Meissner, 2011). It is challenging to manage reliability, recovery, change, service, collection, performance as well as customer experience and relationship. But the results can meet operator s requirements for higher efficiency and improved cost management Market, Product and Customer The main benefits of Big Data-driven analytics can be summarized as follows: The ability to profile and segment customers based on socioeconomic characteristics can allow operators to market to different segments based on their preferences, enhancing customer satisfaction levels and reducing churn. Online social network analysis enables telcos to monitor consumer sentiments towards their operators, react to trends as they develop, as well as to identify influential individuals within communities for direct marketing. Building predictive models for customer behaviour and purchase patterns facilitates the accurate appraisal of each customer s lifetime value, making possible it possible to focus on acquiring and retaining profitable clients. Price & Product mix optimisation, predictive churn management analytics, cross-selling, and location-based marketing are some examples. Dynamic analysis of market demand responses to price/product changes can facilitate optimal pricing and stocking decisions, reducing revenues lost through customer defections. Customer care and sales point efficiency can be met by optimising service time to customers, improving average speed of answer and also employee morale with more consistent procedures. Get a better control over different points of sales activities. Minimise IT involvement by automating processes in product and advanced analytics phases Service Service configuration and activation processes can also be enhanced by Big Data: 1 Operational excellence has had different interpretations along the telecom history. Before ICT ( ), the focus was set on standardization as a means to bring quality and thus reach operational excellence. After ICT ( ), the key aspect was to be able to control the great variety and fragmentation. Now, with the explosion of mobile services, devices and the internet, operational excellence should be reinterpreted as how to offer the highest value with excellent assets. BIG consortium Page 80 of 162

81 Enhance Service Orders delivery - Correct and complete service orders can be done faster. The process of activation on several service platforms can be optimised. Report service provisioning Accurate monitoring of the status of service orders across different service platforms and network elements. Synchronisation of service status over different systems. Speed up the process of service activation by automatic fulfillment of service parameters depending on available data and closing the service order when activation is completed. Ensure service provisioning activities are assigned, managed and tracked efficiently. Identification of services that are no longer required by customers. Optimisation of mediation process and usage ticket production (efficiency in duplicate elimination, correction of usage data records on the fly) Resource (network) All the data on customer usage trends is going up and down the network. The analysis of that data can turn it into usable information. Network analytics also enables tighter control over expenses. The analysis of end-to-end traffic patterns may reveal inefficiencies and extra costs derived from underutilised lines or inefficient use (calls that are routed off the network and back again may imply unnecessary phone charges). Big Data brings the capacity to predict and optimise networks investment requirements, enabling the possibility to, e.g., optimally locate point-to-point routing demands from the traffic forecast, predicting network resource exhaustion in a timely manner or even identifying potential problems by gathering information from social media (e.g. many similar tweets may reveal network issues). Advanced network analytics, with the ability to examine micro events, may significantly shorten resolution time even for the most complex technical issue anywhere in network. By searching for incidences in these systems, operators can identify hidden problems and correct them before they affect users and become extremely expensive to fix. For example, before incorrect bills are produced and sent to customers and consequent complaints arise. Big data can provide a Next Generation Network overview, i.e. unify different networks with different resources under the same operational framework, which can be very useful for network management Suppliers Supply chains are complex systems, producing much data from various sources. Telecom players using analytics to forecast demand changes can anticipate their supply in order to mitigate revenues lost through stock-outs. By analysing stock utilisation and geospatial data on deliveries, operators can automate replenishment decisions to reduce lead times, thereby minimising costly delays and process interruptions. Businesses can also use this data to monitor performance and control their suppliers. Optimal inventory levels may be computed, through analytics accounting for product lifecycles, lead times, location attributes and forecasted demand levels. The sharing of big data with upstream and downstream units in the supply chain, or vertical data agglomeration, can guide operators seeking to avoid inefficiencies arising from incomplete information, helping to achieve demand-driven supply and just-in-time (JIT) delivery processes. In the telecom sector, suppliers are many: points of sale, banks, SIM card and mobile manufacturers, IT providers, etc. BIG consortium Page 81 of 162

82 Business drivers of adoption Telco players have been analysing subscriber data for customer churn and marketing for a long time now. They have also begun to assemble network data with subscriber data for new service offerings around location-based services and context-aware information and this all means that they are aware of the impact of data mining in their business. According to SAS (SAS, 2012), the current difficult economic conditions are pushing businesses to seek cost reductions, in an environment of fierce competition and declining revenues. Thus the principal incentives for telecom operators to engage big data are based on efficiency benefits and innovation possibilities. Where managers consider these gains achievable, the uptake the technology will happen. Recent trends in ICT have seen exploding data volumes and consequent costs associated with accumulating and storing it. These huge quantities of data require strategies to leverage value from them. As analytics technology develops and platforms capable of processing big data become more economical, more businesses will find it viable and will adopt it. The fact that many players already use and trust high-performance analytics or employ big data to some extent can also incentivise other businesses to follow in order to lose the race. According to Heavy Reading White Paper on Big Data Requirements for the telecom sector [HR01], a survey among 65 global telecom providers identified operational planning, real-time service assurance and pricing optimisation to be the most areas where Big Data will help meeting business objectives. Figure 31:Using Big Data to deliver on Defined Business Objectives (survey) It is crucial to analyse the current and future state of the relevant systems, identification of gap analysis and solution recommendations for embarking upon Big Data, which is summerised in the following steps: Conducting a thorough assessment of relevant internal and external data sources. BIG consortium Page 82 of 162

83 Evaluating the existing data reliability and methods to set up the desired levels Assessing the level of need for columnar data stores and NoSQL options for setting up an efficient Big Data solution Identifying analytics needed for competent and actionable insight Assessing infrastructure Inspecting the skills within the enterprise needed to implement and manage Big Data effectively Defining Big Data Baseline and current IT state Finally, as explained in section , it is important that the adoption of Big Data comes along the required tools (existing or new) to analyse and measure the software and its quality Business barriers to adoption According to SAS (SAS, 2012), a major obstacle to adopting big data analytics is the level of technical skill required to optimally operate such systems. Although big data software solutions are becoming more and more user-friendly, specialist knowledge is still necessary. The requisite skills for big data analysis are more demanding than those required for traditional business intelligence systems, and the cost of hiring big data specialists can be too high. However, in the telecom sector this does not seem to be a problem Magazine (European Communications Magazine Q2, 2012) Besides, the infrastructure needed for processing and returning big data queries is not equivalent to that required for ordinary data storage and querying. Optimising hardware for dynamic and complex big data can be a difficult and costly process. Moreover, operators outsource the storage of their data to third parties, for which they typically pay on a pay-per-use basis. When thinking about the storage requirements that big data brings, they might find there is more cost in it than anything else, unless the benefits of big data and the concrete data to store are well identified. The value of information lies in its value to the business for decisionmaking purposes. The data will have value if actions are taken as a result of the information. In theory, having a corporate information repository is important in case it is needed but in practice to create such a repository would be unfeasible. Organisations need to decide which information is the most important to them, and work to integrate that information first (IDC, 2012). Another challenge is related to security. As data volumes rise, so does the cost associated with securing the data against virtual or physical threats. Compliance with data protection regulations must be ensured. Hence some businesses may simply consider the investment too costly. Since the costs of implementing big data analytics are expected to fall many players may prefer to wait until the adoption is less expensive given the current economic situation. There might also exist technical barriers (that BIG project will identify). For example, the migration of huge amounts of data to cloud facilities as a first step in the big data adoption process is not a trivial task. The scale of this migration might exceed all known previous ones and systems downtime is critical for a telecom company, who requires 24/7 operational time. Different migration strategies might be studied but it is very likely that the most suitable choice technically is also the most costly. Besides, other challenges for the telecom sector might be related to the following factors: Lack of reputation and culture in the software domain. Weak relationship with the software developer community. Too long development processes (can take several months). BIG consortium Page 83 of 162

84 Restrictions in service innovation due to strong regulation. Another important factor to consider is that whereas traditional datawarehouses are proven Business Intelligence systems, some Big Data technologies such as Hadoop are still immature. This, and the combination with cloud technology that is required for storing huge amounts of data, which some businesses might consider too risky for security reasons or vendor lock-in, are also barriers to take into account. According to The European Communications Magazine (European Communications Magazine Q2, 2012), in the telecom sector, the most important barrier is not having a concrete strategy for data exploitation, as shown in the next picture: Figure 32: What do you think is the biggest barrier for operators executing Big Data Strategy? - Source European Communications Magazine survey, March Cost effectiveness of the deployment is a major issue for companies. It is essential to gauge the actual implementation cost in the early stage and to avoid extra expenditure on excessive research and analysis. The cost of Big Data involves two main stages: the development phase required to put it in place and the operational phase for maintaining it. As far as development is concerned, some aspects that need to be considered are: Cost involved in programming. Infrastructure cost: memory, CPU and storage required. Specific hardware cost: Hadoop/SoSQL servers, cloud storage, etc. Cost of help and advice from experts. As for the operational phase, the main expenses to consider would be: BIG consortium Page 84 of 162

85 Software / hardware & maintenance every month/year Support team expenses Analysis of future loads based on history or intelligent forecasting. Eventual acquisition of additional resources (e.g. hardware / software /network bandwidth) Role of regulation and legislation The increase of the amount of data across different networks and devices including geographical position information details (by cell identification, GPS or IP access point tracking), mobile apps that share data with third parties, the popularity of social networks that permit users share personal data, the growth of cloud computing and, in general, the rise of technologies that can be combined to aggregate data from different sources arises a number of urgent policy considerations around security and privacy. Big data enables re-identification of data subjects using non-personal data, which threads anonymisation. A fundamental distinction between personal and non-personal data is crucial. Besides, automated processing based on this aggregated data (e.g credit rating, job prospects, insurance coverage profile, etc.) arise issues around access, accuracy and reliability. One of the main concerns is that big data policies apply to personal data, i.e., to data relating to an identified or identifiable person but it is not clear whether the core privacy principles of the regulation apply to newly discovered knowledge or information derived from personal data, especially when the data has been anonymised or generalised by being transformed into group profiles. All this has led the European Commission work on developing deep changes in regulatory aspects of data protection. The reform includes the following main changes (European Commission, 2012): Guaranteeing easy access to one s own data and the freedom to transfer personal data from one service provider to another. Establishing the right to be forgotten to help people better manage data protection risks online. When individuals no longer want their data to be processed and there are no legitimate grounds for retaining it, the data will be deleted. Ensuring that whenever the consent of the individual is required for the processing of their personal data, it is always given explicitly. This could be problematic since users seldom read or understand privacy policies. Ensuring a single set of rules applicable across the EU. Clear rules on when EU law applies to data controllers outside the EU A clear definition of personal data is imperative in order to distinguish personal and nonpersonal data. Moreover, the inference of identity out of aggregated data coming from different sources has to be considered as well in the regulation (currently it is not). The international dimension is also a very important one since users share data without taking into account national boundaries. So far, regulations only apply at national levels and laws differ from one country to another. Ideally, there should be: Transparency: full transparency on personal data usage agreements by telcos (this would be in contrast to the existing OTT players). Customers trust their telco providers. BIG consortium Page 85 of 162

86 Trust: there seems to be a good level of trust. Telcos already handle their customers bills. Control: Consumers need to have the control over the shared data. Value: both customers and operators can monetise the personal data and get benefits off by sharing, profiling, custom offers, etc. Finally, standardisation efforts around Big Data technologies would be beneficial in order to: Increase the robustness and fight vulnerability against hacker attacks Foster the proliferation of consent and data use agreements Achieve accurate and flexible datasets Media and Entertainment Increased levels of broadband penetration According to Eurostat, there has been a massive increase in household access to broadband in the years since 2006 [1]. Across the so-called EU27 (EU member states and six other countries in the European geographical area) broadband penetration was at around 30% in 2006 but stood at 72% in There are of course differences between countries in levels of broadband access, and even within countries some regions are better provided for than others. In some of the more prosperous countries such as Denmark, 90% of the population has internet access, as compared to just over 50% in Romania. Within the UK, Greater London has broadband take-up of 83% compared to just 32% in the Scottish Outer Hebrides [2]. For households with high-speed broadband, media streaming is a very attractive way of consuming content. They no longer need to be tied to TV and radio schedules, and can readily stream movies etc. from online providers. Equally, faster upload speeds mean that people can create their own videos for social media platforms Volumes of data and cloud computing Digital media enables the gathering of more analytics about service usage than was ever possible before. Web analytics programs such as Google Analytics or Adobe Omniture can report in extremely rich and customisable ways compared to the basic server logs that websites had to use up until comparatively recently. In broadcasting, the gold standard for live viewer data is still based on a highly representative sample of households (the UK s BARB system is one of the best examples of this [3]). But ondemand and streaming media usage can be measured much more extensively both the scale of viewer engagement, and the granularity of those interactions (e.g. how many viewers watched all the way to the end of a programme). Web and broadcast media analytics are examples of Big Data sets that can be highly structured, and markets for these applications are maturing rapidly. But perhaps one of the most exciting aspects of Big Data innovation may come in the paradigm shifts in the processing and analysis of unstructured data, both internal to the enterprise and on the social web. The Library of Congress stores about half a billion tweets every day but they are not yet fully publicly available. Querying four years worth of tweets currently takes about 24 hours for LoC staff [4]. This indicates that there is still much work for software vendors and their industrial customers to do to make such enormous unstructured datasets usable. BIG consortium Page 86 of 162

87 Cloud computing has generated almost as much attention as Big Data over the last couple of years, and the technologies are clearly complementary. Tools such as Hadoop, which are built on the Hadoop Distributed File System (HDFS) and the MapReduce model, and languages like Pig make it easier for developers to manage large datasets in the cloud. Cloud services offer flexibility and reduced costs. Instead of maintaining hardware that may not always be needed, cloud services enable a cluster of nodes to be opened on demand, used to run a job and then closed. For example, the Israeli middleware company Gigaspaces recently demonstrated a real time analytics project (focused on counting, correlation and research) over data from Twitter that would not have been feasible without cloud technologies [5] Consumer behaviour and expectations Consumers are no longer passive recipients of media. There has been a huge shift away from mass, anonymised mainstream media, towards on-demand, personalised experiences. There will always be a place for large-scale shared experiences such as major sporting events or popular reality shows and soap operas. However, consumers now expect to be able to watch or listen to whatever they want, when they want it. Services such as the BBC s iplayer have finally ushered in a long-promised age of Martini media ( anytime, anyplace, anywhere ) [6]. The publishing and advertising industries have customarily used demographics to target products and marketing. However, since the proliferation of channels and devices in the last 15 years, some of the traditional differentiators of age, gender and so on are no longer the only ways that audiences may be segmented. Niche groups and interests can be identified much more easily by Big Data processing capability, and their needs served accordingly. For pureplay digital publishers, long tail business models will become more attractive as the costs of storing and analysing customer data come down. As detailed in the section above, digital on-demand services have radically changed the importance of schedules for both consumers and broadcasters. Streaming services put control in the hands of users who choose when to consume their favourite shows, web content or music. The largest media corporations have already invested heavily in the technical infrastructure to support the storage and streaming of content. Big Data will make this more accessible to smaller players, and enable them to target investment at the value-add services that would differentiate them from competitors. For example, the number of legal music download and streaming sites, and internet radio services, has increased rapidly in the last few years consumers have an almost-bewildering choice of options depending on what music genres, subscription options, devices, DRM they like. Over 391 million tracks were sold in Europe in 2012, and 75 million tracks played on online radio stations [7] The rights landscape anonymity, privacy, data protection The explosion in the digital media landscape has not come without issues around consumer rights. Open data in particular heralds much promise in transparency, positive social change and new business models. However, there is a risk that governments and corporations could use the ever-increasing amounts of data being captured for more nefarious purposes. There is always a tension between collecting the appropriate amount of data to conduct business, and collecting more just because it s possible and might come in handy later on. The online loan provider wonga.com can scour more than 6,000 pieces of publicly-available data points on an individual before deciding to lend money or not [8] decisions that could have a massive impact on someone s life. Supermarkets, pharmaceutical companies and media organisations can aggregate data from millions of people in order to improve many aspects of their operations but in doing that they are also storing and accessing detailed information about an individual. BIG consortium Page 87 of 162

88 Social network users contribute masses of information to platforms in return for free services. As many have observed if you re not paying for it, you re the product [9]. Facebook in particular out of all the big social networks has often been criticised, blocked or sued over its approach to user privacy [10]. In a recent case, Facebook was taken to court in Germany by an internet privacy organisation [11]. Differences in privacy laws between Ireland (where Facebook in Europe is based) and Germany meant that it was unclear whether German users could be forced to enter their real names and personal information. Data protection in the EU comes under an EU-wide directive. It is due to be replaced in 2014 with more simplified administration but extra obligations for data controllers [12]. Media companies hold significant amounts of personal data, whether on customers, suppliers, content or their own employees. As they leverage the potential of cloud computing, they will face new challenges to keep data safe. Companies will have responsibility not just for themselves as data controllers, but also their cloud service providers (data processors). Many large and small media organisations have already suffered catastrophic data breaches two of the most highprofile casualties were Sony [13] and Linkedin [14]. They incurred not only the costs of fixing their data breaches, but also fines from data protection bodies such as the ICO in the UK [15] Big Data Application Scenarios This section intends to provide different use cases that show the potential of Big Data technologies for the telecom, media and entertainment sectors Telecom A call to a telco Customer Care centre Scenario Name Background / Rationale Scenario description (Storyboard) A call to a telco Customer Care Centre This scenario represents a future daily situation in which big data technology can bring telecom operators and customers together. The most interesting side of this story is how far this is from reality today in the Customer Care domain. A customer is using his smartphone when suddenly a message appears on the screen indicating Emergency Calls Only and the call is suddenly dropped. He checks the rest of applications that were running in parallel. He had a VoIP application with several on-going conversations that seems to be disconnected now. He was also uploading a video to his favourite social network, which has not been totally completed according to the status shown on the screen. Finally, he was in the middle of a process for downloading some MP3 files from an on-line store who has already received the payment for the merchandise but not all the files have been successfully retrieved. He resets his smartphone and checks that ordinary telephony works but not his internet connection. He decides to call the Customer Care (CC) number and is immediately served by a friendly agent who knows his name. When he begins reporting his problem, the customer care agent (CSA) listens carefully and registers the description of the incident from the customer s perspective in written text. The CSA says that optionally he can attach the voice conversation to the customer s file, which he does. He asks the customer whether he had open applications when the incident occurred, out of a list of the apps the customer has installed in BIG consortium Page 88 of 162

89 Functional areas covered by the scenario Technical domains involved his smartphone, which is known at the CC. The CCA checks the network nodes status in the incidental area and immediately sees that there has been a problem in one of them. He sends an instant notification to the technical support team, who are already aware of the incident and who report that the problem will be fixed in 5 minutes. The CCA checks additional information concerning this particular customer. Not only he has been a customer for more that 10 years, but he has recently subscribed to one of their ADSL offerings. The operator has punctually received payments for the unified electronic bills available in the system. He has also raised several complaints in the past concerning the quality of his VoIP service in multi-party business calls when a number of users are involved. He also gathers his most used numbers for calls and for SMS. The usage information concerning the uplink and downlink transactions with unachieved service delivery are disregarded for billing by the CSA. This action of discarding the records is registered in the system for future uses, as well as the whole procedure. The CSA informs the customer that he can relaunch both transactions with no duplicated cost. The payment for the music records will be cancelled with the content provider by the CSA. Besides, the CSA considers that this customer deserves a loyalty action to be applied, the CCA proposes to activate two separate discounts on calls and SMS to those favourite numbers that will begin being applied immediately. Right afterwards, the customer writes an entry in his blog explaining how well he has been served by his telecom provider. Network: Reduce Problem Resolution Time Increase Staff Efficiency CDR analysis Sales / Marketing /Profitability Manage Vendor Performance Advanced Customer Segmentation Product/Service immediate Offering Foresee / Reduce Customer Churn Gain Insights on Customer Behaviour Convergent & intelligent Customer Care Data acquisition, data analysis, data curation, data storage, data usage Advanced customer segmentation Scenario Name Background / Rationale Scenario description (Storyboard) Advanced customer segmentation This scenario is partially based on a use case carried out by British Telecom and reported in the Global Information Technology Report 2012 by the World Economic Forum [WEF01]. This use case shows how the combination of data coming from different sources (web, geographical data, network information) can be used as an input for advanced customer segmentation and sales regions definition BIG consortium Page 89 of 162

90 which can, in turn, help sales departments offering strategies. A hundred million links among more than 20 million numbers made anonymous from an original database of some 8 billion telephones can be examined and the analysis of the resulting network to identify natural communities in the data, (where a community is characterised by relatively dense within-group links and proportionally fewer out-group connections.) This information can be combined with information retrieved from social media. What are these customers saying in their favourite social networks to one another? What are they being told? What is being said about a new product? How often is it mentioned and how far does it get? What kind of blogs does my product appear on? Are they specialised, generic, complaints forums, etc.? Another authorised survey of thousand households and the association of more than a million call records to their responses made possible to assess the classification of households according to their calling networks. The results suggest that some dimensions of social interaction can serve as reasonable predictors of whether a household is comprised of Alone, over 56, Couple, both aged over 55 with no cohabiting children, or Couple, with children aged under 12. A community is almost synonymous with "segment." A segment is a group of customers that will react similarly to a message. This helps identify new marketing campaigns and defining business opportunities and the right moment to communicate with customers. Communities can be found in many contexts, but broad criteria from a telecom perspective might include shared calling groups, locality, age, interests, language and subcultures. Telecom operators would seem to have particular advantages in regard to community-based marketing. They have unprecedented, aggregate detail about customers communications habits and movements as well as a feedback channel that could be used for tailored messages. Functional areas covered by the scenario Technical domains involved Web: Sentiment analysis Network: Traffic analysis Sales / Marketing /Profitability Advanced Customer Segmentation Product/Service Offering customisation Foresee / Reduce Customer Churn Gain Insights on Customer Behaviour Data acquisition, data analysis, data curation, data storage, data usage Telecom customer journey Scenario Name Background / Rationale Telecom customer journey This scenario tries to go through the customer journey in order to understand what data is exposed and how does the customer experience the relationship of service and trust with the telecom operator. BIG consortium Page 90 of 162

91 Scenario description (Storyboard) A telecom customer sees a product especially pushed for him on a website. He buys and pays online via laptop. The customer is notified of delivery time on his personal cell phone and decides to collect the item directly from the point of sales. He registers via tablet to receive further notifications and makes several calls to the customer service because there is a missing accessory in the box. The separate piece is delivered separately. Later one, the customer tweets about it and uploads photos of his new device via Facebook, which is liked by his circle of friends. They are all planning a trip to Indonesia according to the comments on their walls. They are also fans of cultural events and share this sort of information in their community. Before her departure, the operator can offer a special international roaming plan for calls in Indonesia. Besides, the group of friends might be offered a holiday package with local lodging companies. The operator can also offer a notification service about local festivals and cultural events when abroad. Functional areas covered by the scenario Technical domains involved Web: Sentiment analysis Network: Traffic analysis Sales / Marketing /Profitability Advanced Customer Segmentation Product/Service Offering customisation Foresee / Reduce Customer Churn Gain Insights on Customer Behaviour Data acquisition, data analysis, data curation, data storage, data usage Dynamic bandwidth increase Scenario Name Background / Rationale Scenario description (Storyboard) Dynamic bandwidth increase This example illustrates how the information retrieved by Call Centres can help identify infrastructure and network problems In most organisations, the customer care data is analysed typically from a SLA(Service Level Agreement) perspective. For example, turnaround time, average wait time, etc. are often measured and ensured. However, a greater insight can be gained by, e.g. the actual transcript of the conversation. This could even lead to the identification of problems regarding the telecom infrastructure (e.g. infrastructure bottlenecks). Telecom providers are currently getting more revenue from data services than from voice services. This is why operators are very interested in launching new services that generate a lot of traffic, such as cloud-based BIG consortium Page 91 of 162

92 gaming, for example. In this competitive environment a telecom provider launches a new viral gaming application on mobile devices. A few days after its launch it starts observing a burst of calls to the call centres and on text mining the transcript data specialists find a great increase in the keywords alluding to performance. The specific intelligence regarding keyword burst and specific time of day at which this was encountered can be shared with the infrastructure planning group which then put a plan in place to dynamically change the provided the bandwidth based on usage. Functional areas covered by the scenario Technical domains involved Web: Sentiment analysis Network: Network service enhancement Sales / Marketing /Profitability Customer satisfaction Data acquisition, data analysis, data curation, data storage, data usage Security application based on cell towers Scenario Name Background / Rationale Scenario description (Storyboard) Security application based on cell towers This example illustrates how the information retrieved by Call Centres can help identify infrastructure and network problems When a call is made, the operator usually captures data such as the subscriber, the time and the duration. Depending on the type of call and service used, additional data can be gathered. For example, serving switch data, serving cell tower IDs, device identification (serial) numbers, as well as International Mobile Subscriber Identity (IMSI) and International Mobile Equipment Identity (IMEI) codes. The unique ID of the cell tower a handset was connected to when a connection was made can be used for collocation analysis. By examining terabytes of CDR/Tower records from the switch it is possible to triangulate on a few collocation events. A co-location event can be defined as the same mobile tower being used to route calls during a specific point in time. The combination of massive Hadoop clusters and columnar database architecture allows these queries to be executed at great speed in order to retrieve a reduced set of records to analyse further. This allows to identify if the same person has used several devices by combining CDR information (IMEIs or IMSIs) with network information (registered by the mobile tower). Functional areas covered by the scenario Network: CDR/tower information Sales / Marketing /Profitability BIG consortium Page 92 of 162

93 Technical domains involved Security application Data acquisition, data analysis, data curation, data storage, data usage Media and Entertainment Data journalism Scenario Name A large dataset becomes available to a media organisation Background / Rationale Scenario description (Storyboard) This scenario deals with a situation where a database with many thousands or even billions of rows requires analysis to derive stories and insight. So-called data journalism has been an emerging trend over the last few years. As corporations and governments release more data into the public domain, there are opportunities for media organisations to exploit the data. However, these large datasets will be too unwieldy to be stored in-house by firms, and certainly too big to be analysed by limited human resources in the modern newsroom. The data will also most likely have lots of errors, gaps, inconsistencies and poor categorisations. Therefore the media organisation puts the database online and crowdsources the analysis to its readers. Developers and activists can trawl the data and advise the organisation of interesting nuggets of information they find. Journalists can work with users to develop interesting lines of narrative. The media firm needs the ability to leverage the data in an end-to-end process right from the initial acquisition of the data, to cleaning it up, to making it possible to analyse it with a combination of statistical algorithms and human brain power. Once the analysis is done, the journalists will plan how to use the results in their content and tell stories about what has been found. Scaleable data visualisation tools are needed in order to produce usable, compelling graphics for general readers. The resulting output from the work has a positive impact in shaping the news agenda, and increases sales and brand engagement for the company s products across all channels. Technical domains involved Data acquisition, data analysis, data curation, data storage, data usage BIG consortium Page 93 of 162

94 Dynamic semantic publishing Scenario Name Scalable processing of content for efficient targeting Background / Rationale Online content producers are creating greater volumes of multimedia content than ever before. At the same time, there are increased pressures to cut costs and headcount, so organisations are having to make more efficient use and reuse of what they do produce. This scenario shows how a publisher uses semantic technologies to provide the means to target content more efficiently. Scenario description (Storyboard) A news organisation that produces millions of words, thousands of images and thousands of hours of video every year needs to develop new revenue streams. They identify new markets such as niche websites and hence need a way to deliver that content. They also wish to be more responsive to customers new requirements and wish to minimise time and costs in getting new services to market. The content already has very limited tagging but this does not provide enough detail or context to be truly useful. They decide to invest in Big Data infrastructure that supports the addition of much richer semantic metadata. This involves using an XML database and a triple store rather than traditional relational database technologies. Automatic text analysis is deployed to extract the semantic concepts. It is now much easier to aggregate content around topics or entities, and to automatically generate interesting related content items a task that would previously have been done by a journalist. New niche customers can be served quickly as the correct data infrastructure is already in place. Technical domains involved Data acquisition, data analysis, data curation, data storage Social media analysis Scenario Name Background / Rationale Scenario description (Storyboard) Batch processing of large user-generated content datasets With millions of pieces of user-generated content being added to social networks every day, there is an unprecedented opportunity for the media sector to mine it, whether in batch or real time. A new start up decides its business model will be based around performing deep-dive analysis of tweets to help marketers connect with the most influential people and conversations in their product domains. Commercial organisations will commission the start up to identify good prospects on social networks so that they can offer innovative new products to them. They do not want to maintain their own hardware so will use cloud services to access storage and processing capability as and when they BIG consortium Page 94 of 162

95 need it. They want to work with a diverse range of clients so they need their architecture to be flexible enough to cope with competing requirements. They adopt a distributed querying architecture and build flexible categorisation models on top of the databases. They configure cloudbased Software-as-a-service (SaaS) text analytics applications according to their clients requirements. As a result of the company being able to analyse huge volumes of data meaningfully in a short period of time, their clients can confidently try out new promotions on strong customer prospects. Technical domains involved Data analysis, data curation Cross-sell of related products Scenario Name Developing recommendation engines using multiple data sources Background / Rationale Recommendation engines have been a feature of digital services for many years, the most famous example being Amazon s. They use customers browsing and purchasing history to automatically suggest other products users may be interested in. There are three main types of product recommendation engine, those that work by collaborative filtering, those that work by content-based filtering, and a hybrid category that relies on both methods of filtering. This scenario deals with the potential of Big Data to make it possible for not just the companies with the most hardware and engineers to develop sophisticated recommendation products. The cost of storing and processing data streams is reduced markedly by cloud computing and distributed querying. Scenario description (Storyboard) A publisher in the entertainment space wishes to monetise its content by selling products and advertising around it. They want to provide a great user experience to differentiate themselves from competitors who are also delivering reviews, offers and recommendations to their customers. They set up systems to collect data about customers interactions with the content, e.g. clickstream, user journeys, social sharing and purchase history. They develop proprietary algorithms to process the data which is coming from multiple databases, and also factor in commercial priorities such as promoting products with the highest profit margins. They set up the infrastructure in the cloud so that it can scale up or down as needed. For example, there will be much more data gathered at certain times of years such as during school holidays or before Halloween. They use predictive and reactive analytics to estimate when peak traffic may happen, and plan data processing accordingly. The output from the data processing is combined with the metadata of the content items themselves to seamlessly generate recommendations BIG consortium Page 95 of 162

96 when users are browsing the company s site on the web or mobile devices. As a result of successful implementation, traffic to the site increases, users discover things they might not have seen otherwise and commercial revenue increases. Technical domains involved Data acquisition, data analysis, data curation, data usage Product development Scenario Name Using predictive analytics to commission new services Background / Rationale Scenario description (Storyboard) Technical domains involved Broadcasters and online publishers now have the ability to gather huge amounts of quantitative information about how their customers interact with their services. This scenario highlights one tangible application for all that data, namely, mining it to support the development of new products. A streaming media service gathers rich usage data from thousands of its customers. It knows what they buy, who buys it, how they consume it, when they consume it, where in the country they watch it in and so on. It has the ability to tap into terabytes of data. To complement their traditional qualitative methods of research such as focus groups and market research, they analyse the data to try and predict what kind of original content might be successful. They are looking to have a high degree of confidence in their decision-making because it is notoriously difficult to guarantee that something will be a hit with consumers. They mine the data for trends, and use statistics to highlight correlations where users who enjoyed one kind of programme also consumed others, whether of a similar genre or not. The dataset consists of millions of molecular interactions so is too large for purely manual viewing. Also, it is constantly growing so robust applications are needed to ensure that key insights are not missed. On the basis of the analysis, they decide to commission a new programme, and to put a considerable amount of marketing expense behind it. The outcome is the programme becomes one of the most popular the service has ever streamed. The analytics have helped predict a hit show more accurately than a human commissioner might have done. Data analysis, data usage BIG consortium Page 96 of 162

97 4.4. Requirements Telecom Requirements in the telecom sector have been gathered for each of the etom domain, including additional areas, such as a General area for requirements that are applicable to Big Data in general as a tool, as well as another specific area for Social Media. For each domain and pursuing a business goal (that derive from current trends), a high-level business requirement is extracted, and a more technical requirement for Big Data technology is obtained. Besides, for each requirement, we specify the technical domain of Big Data that should provide an answer to it. In the following tables, X in a technical area means that the requirement is strictly related to that technical area. In case the requirement can affect that technical area but it is not highly bound to it, nothing is marked General Operational Area Business goal Business Requirement Technical Requirement Technical area Data Acquisition Data Analysis Data Curation Data Storage Data Usage General Customer satisfaction Customer experience management enhancement (personalised per customer) Gather and use huge amounts of data coming from different sources and in different formats in order to have a wider insight of the customer and his habits, needs, likes and dislikes X Cost reduction Cost reduction Increase customer insight Effective capex and opex management Leverage existing hardware Detect patterns - Lots of detailed historical data analysed in seconds Reduce storage space (thanks to compression or enhancement of data compression techniques) Integration of traditional corporate business intelligence systems with new Big Data technology Quick access to the customer's historical file: bills, payment behaviour, call detail trends, etc. X X X BIG consortium Page 97 of 162

98 Operational Area Business goal Business Requirement Technical Requirement Technical area Data Acquisition Data Analysis Data Curation Data Storage Data Usage Increase customer insight Usability IT tools Big data tools must be easy to use and quickly provide the aggregated information in a comprehensive manner. X X TTM reduction Quick IT evolutions Easy and quick IT big data systems evolution with no decrease in performance (real time). The IT involvement should be minimised. X Cost control IT cost control Tools for measuring the software and its quality. Big data software cannot be more expensive than traditional one for the same functionality X X X X X Security Secure data access Security tools to avoid unsafe data access. Data must be visible to trustful users and only can be modified by authenticated users X X X Operational excellence Security Operational excellence Increase customer insight Time awareness Reliability of information Anticipation Fast access to information Take into account the specifics of time, as the majority of Big Data is dynamic and varies over time. Data must be retrieved from trusted sources Predictive analysis based on correlated data Provide real-time or near real-time responses to users from the systems they interact with X X X X X X Operational excellence Enhancement of Business Intelligence Big Data's unstructured data must feed Business Intelligence systems Table 2. General Big Data Requirements (Telecom sector) X BIG consortium Page 98 of 162

99 Marketing, Product and Customer Operational Area Business goal Business Requirement Technical Requirement Technical area Data Acquisition Data Analysis Data Curation Data Storage Data Usage Operation, Support and Readiness Increase customer insight Enhanced customer information Different data formats can be attached and are available in a customer dossier (voice, free text, logs, video and audio, etc.) Complex and mixed format inputs Operational excellence Quick information availability Reduce data loading time X Increase customer insight Customer satisfaction Advanced customer segmentation Customised offer building Demographical data combined with usage data calls in order to identify "user communities" and reveal further information concerning customers Data based on customer loyalty and other behavioural data. X X X X X X Operational excellence Contextual information Suitable and efficient content and context-awareness. Take the context of the data into account and present it accordingly X X Operational excellence Contextual information Take into account the specifics of time, as the majority of Big Data is dynamic and varies over time X X Fullfilment Cost reduction Reduced efforts and administrative workload on vendor schedule data processing (order capture) Availability of the correlation of data from different sources X X Revenue increase Operational excellence Operational excellence Assess revenue leakage in the order-to-cash process Order-to-cash process convergence Channel source identification Ensure in advance that the fulfilment can be achieved. Delay minimisation (mostly for TV and fix telephony offerings). Possibility to combine data from heterogeneous networks to crossproduct fulfilment (e.g. activation of mobile services to complement fixed services) Allow all operations to be identified by means of the source channel through which they were requested. X X X X BIG consortium Page 99 of 162

100 Operational Area Business goal Business Requirement Technical Requirement Technical area Data Acquisition Data Analysis Data Curation Data Storage Data Usage Customer experience Optimised service time to customers by improving average speed of answer The most important information across multiple domains must be quickly available at Customer Care X X Assurance Revenue assurance Interface to network inventory solutions, service activation solutions and networks discovery information Short complex query resolution time X Increase customer insight Customer experience Customer satisfaction Customer segmentation analysis Conduct advanced analytics Customised product offers optimisation Combine customer basic data with usage data Network behaviour and subscriber correlation X Immediate customised offerings X X X X X X Customer satisfaction Conduct predictive churn management analytics Analyse customer and social data in order to prevent churn X Billing Revenue increase Customer satisfaction Revenue assurance Customer satisfaction Operational excellence Cross-selling Location-based marketing Business impact analysis Improvement of real-time services for consumption and billing Price comparison, bill simulation Convergent offering for different services and networks (fix and mobile) Real time cross-sectorial offerings based on customer location Incorporate the necessary information to keep track of the business impact of offerings Customers can retrieve in real time the information concerning their consumption Ability to simulate a bill with a price model not subscribed by the customer X X X X X X X Table 3. Big Data Requirements for the Marketing, Product and Customer domain (Telecom sector) X BIG consortium Page 100 of 162

101 Service Operational Area Business goal Business Requirement Technical Requirement Technical area Data Acquisition Data Analysis Data Curation Data Storage Data Usage Operation, Support and Readiness Fullfilment Operational excellence Customer satisfaction Service deployment operational planning Optimised service delivery time to customers Assurance Cost reduction Identify cost savings and improve services. Optimise service deployment time according to historical data related to every involved node and service platform Minimise service provisioning time and ensure the process across different platforms Analyse huge amounts of data from call detail records and inter-carrier invoices daily to help communication service providers (CSPs) X X X X X X Assurance Increase customer insight Real-time analytics should be able to retrieve information about the subscriber that is available from surrounding systems, network, social networks, etc. CDR and social network information combined in real time X Operational excellence End-to-end real time service measurement Analyse how much time is required for CDR data collection (different services, different times) X Revenue assurance Real-time SLA management and service assurance Respond to network issues based on SLAs X Billing Increase customer insight Fast access to billing historical data and ongoing consumption Multiple data formats, historical bills and ongoing consumption X X Table 4. Big Data Requirements for the Service domain (Telecom sector) BIG consortium Page 101 of 162

102 Resource Operational Area Business goal Business Requirement Technical Requirement Technical area Data Acquisition Data Analysis Data Curation Data Storage Data Usage Operation, Support and Readiness Operational excellence Network operational information Availability of information such as: Call attempts per cell, Cell failures per cell, Handover request per BSC, Calls connected, Calls cleared by user termination, PDP creation time, SGSN attach requests, SGSN attach success rate, Call establishment time, APN usage statistics. X Operational excellence Network quality improvement Isolation and correlation of network faults and issues, security detection and prevention mechanisms, traffic planning, prediction of hardware maintenance or the calculation of drop call probability. X X Fullfilment Revenue assurance Tools to efficiently plan, process and predict network growth based on past capacity utilisation, marketing demand and service consumption trends Mix network information with social network information in order to anticipate social events that might require additional resources (traffic forecast) X X Accelerate the provisioning success rate Assurance Operational excellence Network inventory information Operational excellence Predict network resource exhaustion in a timely manner Accurate real-time network information concerning ongoing provisioning requests across different domains. Data from network elements, such as cell towers, routers, media gateways, session controllers, switches, etc. Unify different networks with different resources under the same operational framework Retrieve and correlate information about network capacity management & resource utilisation X X X X X X BIG consortium Page 102 of 162

103 Operational Area Business goal Business Requirement Technical Requirement Technical area Data Acquisition Data Analysis Data Curation Data Storage Data Usage Operational excellence Network optimisation Identification of potential problems gathering information from social media (e.g. many similar tweets coming from the same location) X X X Billing Operational excellence Consolidated convergent rating and billing information Ability to combine CDRs, rated records and information data coming from different sources (mobile and fixed) Table 5. Big Data Requirements for the Resource domain (Telecom sector) Supplier/partner Operational Area Business goal Business Requirement Technical Requirement Technical area Data Acquisition Data Analysis Data Curation Data Storage Data Usage Fulfilment Revenue increase Point of sales location strategy Billing Cost reduction Identify cost savings and improve services. Determination of new points of sales' location based on demographical data Analyse huge amounts of data from call detail records and inter-carrier invoices daily to help communication service providers (CSPs) Table 6. Big Data Requirements for the Supplier/partner domain (Telecom sector) X X BIG consortium Page 103 of 162

104 Social Media Operational Area Business goal Business Requirement Technical Requirement Technical area Data Acquisition Data Analysis Data Curation Data Storage Data Usage General Revenue increase Revenue increase Identify where do telco conversations mostly take place Retrieve and correlate information such as: number of followers, number of people following, influence rate per follower/followed, contact intensity, mood attributes, recommendations to and by others, school, lifestyle, opinions about products owned, opinions about customer service, opinion about telcos, etc. Table 7. Big Data Requirements for the Social Media (Telecom sector) X X X BIG consortium Page 104 of 162

105 4.4.2 Media and Entertainment Data Acquisition According to Capgemini [16] there are three main categories of Big Data acquisition providers software vendors such as HP and IBM, data providers such as Experian and governments, and finally social networks such as Twitter. Relational databases are relatively well-understood products compared to XML, NoSQL and triple store technologies. The risks of adopting the latter systems need to be understood by business decision-makers so that their potential in innovation can be exploited. Many tools for data acquisition are open source and hence have low setup costs (though the Total Cost of Ownership may not be as economically attractive). However, according to Martin Strohbach of the BIG project [17], the landscape is confusing for anyone unfamiliar with the options. There is therefore a need for the vendor market to mature so that business leaders can more easily make optimal procurement decisions. Kick-off discussion with TWG needed Data Analysis For the media and entertainment sector (and indeed other sectors) the ability to perform meaningful analysis from datasets that may be terabytes in size is one of the most exciting applications of Big Data. Once the data has been acquired. It will likely need some cleaning up, as very few datasets, especially large ones, are without gaps or errors. Tools that can automatically identify data quality problems will be required, as well as tools that allow easy manual update of individual data points. When the UK broadcaster Channel 4 ran a Big Data pilot project, they built their own dashboarding and analysis tools for their data scientists and analysts [5]. Although many Big Data vendors offer analytics platforms, a Wikibon market forecast posited that one of the main blockers in the market was a general lack of development tools and frameworks for developers to build custom applications [18]. Nigel Shadbolt of the Open Data Institute (ODI) has spoken of his wish for a Github for open data a repository of freely-available code for analytical software [19]. Predictive analytics is especially important to media organisations in the areas of Customer Relationship Management (CRM), cross-sell and marketing. The ability to apply statistical analysis to large, complex data sets can help, among other things, to build a better understanding of service usage patterns. Data mining and text analysis are well-established disciplines, but the interest in Big Data may drive the requirement for more scaleable applications that can be deployed rapidly. Media companies are increasingly monitoring social media to help them get alerted to breaking news, also to analyse sentiment and trends around popular topics that they want to cover. Real time analytics tools will thus be a key component of implementing a multi-channel content strategy [20]. Google has released an experimental application called Fusion Tables [21] which allows data to be managed and visualised at a scale beyond what most business users have in the enterprise (eg Microsoft Excel). However, it is still only an experiment, and organisations should not rely on Google to keep the service live indefinitely (as witnessed by the controversy over the withdrawal of Google Reader). Google Refine was a tool to improve data quality, and is now being made available as an open source tool called OpenRefine [22]. Media companies will need to find an BIG consortium Page 105 of 162

106 appropriate balance between using free web-based tools, and paying for stable, commerciallypackaged applications. Sector-specific and technical landscape discussions with TWG needed Data Storage One of the main trends driving the Big Data revolution is the exponential growth in unstructured data. Relational databases have been around for more than thirty years, and consequently are mature technologies that are robust and underpin a large proportion of transactional systems. However, they demand rigid definition of schemas upfront which are not easily changed (indeed, this is part of what makes them robust). In order to deal with the overload of unstructured and less-structured documents and data, new types of database have emerged, including XML (the market leader being Marklogic), NoSQL (egmongodb) and triple stores for semantic applications. Channel 4 stored 2 ½ years worth of data from their Viewer Insight Database using Hadoop implementations stored in the cloud [5]. The dataset was some 2.9 terabytes in size manipulating a dataset of this volume with a small team would just not have been feasible until very recently. Businesses may struggle to use traditional data warehouses for either intensive processing demands, or even to hold unstructured data in the first place. However, as the technology evolves, and memory continues to fall in price, cloud-based data processing projects could be viable for more than just the most well-resourced media firms. Security both in the cloud and in the enterprise remains a concern. Three South Korean broadcasters were hacked in March 2013 [23]. Due to their prominent profile in society, media companies in developed economies are a high-profile target for cyberattack. The BBC, Reuters and Agence France Presse (AFP) have all had twitter accounts hacked by pro-syrian government activisits [24]. Services that are being run from cloud data centres somewhere in the world are vulnerable to various threats including the weather. Amazon Web Services experienced two major outages in 2012 due to thunderstorms in Ohio [25] and server problems in Virginia [26] respectively. Notable casualties included Reddit, Pinterest, Airbnb and Foursquare [27]. According to [28] : The essential characteristics of big data -- the things that allow it to handle data management and processing requirements that outstrip previous data management systems, such as volume, data velocity, distributed architecture and parallel processing -- are what make securing these systems all the more difficult. The clusters are somewhat open and self-organizing, and they allow users to communicate with multiple data nodes simultaneously. The European Commission is funding a study into security for cloud service providers [29]. As highlighted in the Industrial background section, tougher data protection legislation in the EU will necessitate effective security management. Sector-specific discussion with TWG needed Data Curation Data may be curated for many purposes, with the primary use case for the media industry being to help create products and services for consumers producing a newspaper is essentially data curation, as is the process of applying tags to web content items in order to create pages, packages and feeds of information. BIG consortium Page 106 of 162

107 With the growth of data (used in the widest sense) on the web has come recognition of the value of the role of human data curators in finding, selecting and disseminating the best resources. Automated technologies such as APIs, aggregators, personalised social media services can help filter the deluge to some extent, but best-of-breed data curation tools should support human intelligence in sorting the signal from the noise. Data curation tools need to be fully integrated into the content lifecycle, eg if business users are expected to apply concept tags to content they are creating, the concepts should be available or addable instantly, so as not to slow down the publishing process (particularly important in news). As data curation may be part of many people s jobs across the organisation (eg content creators, librarians, metadata managers, semantic developers), the tools need to align with business processes around governance and workflow. Big data makes it easier for organisations to ingest, process and output data streams from diverse sources. Data curation generally sits somewhere in the middle of those processes (architecturally and temporally). Therefore the applications need to scale for the demands of large and/or real time datasets, and be able to export data in different formats. Modular solutions for data management and curation that integrate with enterprise applications can be built or bought, and need to accommodate the fact that most media companies have legacy systems. Provenance of information is an important concern for anyone in the media, and data curators who are using other s data to augment what they are producing have to be especially aware of the risk of bad information being published. Big data makes it possible to find and ingest a myriad of data, but the tools that curators use must support good governance. Fact-checking, automatic checking for data updates from public sources, and review workflows are some of the features needed to enhance trust in the curated data. Data marketplaces and linked data repositories offer media companies large amounts of free or low-cost datasets that they can leverage to enrich their own data and content. There are private sector services such as Infochimps and the Windows Azure Marketplace, also public sector open data sites such as data.gov. Big data curation tools need to be able to access and ingest this data, whether through APIs or files in different formats. As there are strong overlaps between media and data curation, recommend close collaboration with the TWG on producing a case study, co-ordinating interviews etc This section will be updated following an initial interview with Helen Lippell of PA Data Usage The rise of storage technologies such as Hadoop have somewhat dominated the conversation around Big Data. However, it is futile to implement Big Data infrastructure without considering the full end-to-end lifecycle of the data. Data usage may be seen as the final step on this journey. Whether the data is for purely internal use (eg product development) or external (eg driving a user-facing service), it needs to enable decision support for the right people, whatever role they hold in the organisation. Big Data systems will be maintained by technical staff but the outputs need to be made accessible, meaningful and adaptable for non-technical users. As identified in the Social media analysis application scenario, Big Data has the potential to speed up the innovation cycle by allowing iterative, data-driven product development. Tools that enable flexible querying of huge datasets, e.g. to segment audiences along new dimensions or to programmatically identify audience niches that had not previously been considered, will be essential. BIG consortium Page 107 of 162

108 There are many data visualisation tools on the market such as Tableau, and many have free versions [30]. However, there are fewer options for extremely large datasets which require more sophisticated visualisation than simple or static charting. In the media sector this might be web traffic analysis during a big event to ensure that services don t fail, or analysis in real time of millions of viewer interactions. The tools should also make it straightforward to set up dashboards that show only the most essential information the user needs. There is a need for better business user tools to manage the algorithms that sit across Big datasets. Although the data scientist role has been heavily promoted by the Big data community, not all data users have a scientific or mathematical background. They need intuitive user interfaces to tweak algorithm parameters (e.g. in recommendation engines) as needed. Liaison with TWG needed 4.5. Conclusion Telecom The telecom sector seems to be convinced of the potential of Big Data Technologies. The combination of benefits within different telecom domains (marketing and offer management, customer relationship, service deployment and operations, etc.) can be summarised as the achievement of the operational excellence for telco players. However, there are still challenges that need to be addressed before Big Data is generally adopted. Big Data can only work out if a business puts a well-defined data strategy in place before it starts collecting and processing information. Obviously, investment in technology requires a strategy to use it according to commercial expectations; otherwise, it is better to keep current systems and procedures. Most generally, the required strategy might imply deep changes in business processes that must also be carried out. The challenge is that operators have just not taken the time to decide what this strategy should take them, probably due to the current economic situation, which leads to shorter term decisions. This might be the origin of the Data as a Service trend some operators are following and consisting on providing companies and public sector organisations with analytical insights that enable these third parties to become more effective. This involves the development of a range of products and services using different data sets, including machine to machine data and anonymised and aggregated mobile network customer data Media and Entertainment Europe is behind the US in innovation It has been suggested that European companies in general have a longer decision-making cycle than their American counterparts [31]. Clearly this could impact the speed of adoption of Big Data technologies to help businesses innovate and grow. Financial services and primary and secondary industries are However, the media and entertainment industry may be better placed than some other sectors in this respect, as they are already having to adapt to seismic changes in technology and customer behaviour. Most well-established media organisations are aware of the importance of the shift to online, and the consequent explosion in data availability, even if they are not necessarily deriving full business value from it yet. BIG consortium Page 108 of 162

109 Big Data is not just for big companies The prevalence of open source tools and cost-effective cloud-based services represent an opportunity for SMEs as well as start-ups to be at the vanguard of using Big Data to generate new revenue streams. Start-ups such as Mastodon C in the UK [32] [33] are working in tandem with the Open Data Institute (ODI) to offer agile Big Data services to clients. Even small websites can gather large enough data to derive actionable insights [34]. More work needs to be done by government and the EU to ensure that there are minimal barriers to entry for ambitious, visionary entrepreneurs who want to provide services to media in the Big Data space Importance of businesses setting concrete goals Christer Johnson, IBM s leader for advanced analytics in North America, has given this advice to businesses starting out with big data: First, decide what problem you want to solve [35]. Recent research by the US data marketplace firm Infochimps [36] identified that only 45% of Big Data projects are successfully completed. The top three reasons cited were inaccurate scope, technical roadblocks and organisational silos. The top two reasons for analytics projects to fail were lack of staff expertise to understand the data, and lack of business context around the data itself. All of these risks could be at least partly mitigated by setting clear objectives for Big Data projects Democratisation of data Big data is not just about the sheer volumes of data, or the technology stack behind it. The business and the human elements are critical to the success. In many parts of the media, the barriers between business and technology roles are blurring e.g. journalists are expected to produce multimedia content or manipulate datasets, developers work as core parts of multidisciplinary teams, and product managers need to have broad and deep technical knowledge. Data scientist is a much-discussed role that is core to enabling the exploitation of Big data. It has been described as Nigel Shadbolt as being part-storyteller, part graphic designer, part coder [19]. Media organisations facing falling revenue from advertising and print products may need to spread those skills around existing resources, e.g. helping information professionals and product managers acquire data manipulation skills. The CEO of Infochimps, Jim Kaskade, has said put the business users and application developers in touch with the data and make it so simple that they don t need other people to get the job done [37] Disintermediation Perhaps the single biggest change in media in the last 20 years has been the rise of disintermediation, that is, the process whereby the traditional relationship between mass media media suppliers and consumers is disrupted by new technologies, cultural shifts etc. News events are now frequently broken on Twitter rather than by press agencies. A single blogger can sometimes contribute as much insight to the national conversation around a story as a traditional newsroom. Content producers are responding to customer conversations and sentiment rather than the other way round. Progressive media companies are following the lead of scientific data owners to crowdsource data analysis. The SETI project was an early example of using distributed computing power, and current projects such as Galaxy Zoo [38] take it further by asking people to classify BIG consortium Page 109 of 162

110 galaxies. A lively community has built up around the work. The Guardian crowdsourced the analysis of the parliamentary expenses data [39], using their community to generate more rich insight and storytelling than would have been possible on their own. Big data needs to support distributed data architectures that will it possible for more organisations to benefit from their user communities Abbreviations and acronyms API AFP BARB BBC BSS B2C CAGR CRM DRM EU HDFS ICO IVR LoC M2M NoSQL ODI OSS OTT RFID SaaS SME SMS SETI TWG UGC USSD XML Application Programming Interface Agence France Presse Broadcasters' Audience Research Board British Broadcasting Corporation Business Support Systems Business to consumer Compound Average Growth Customer Relationship Management Digital Rights Management European Union Hadoop Distributed File System Information Commissioners Office Interactive Voice Response Library of Congress Machine to Machine Type of database that stores and retrieves data more flexibly than a relational database that can be queried using SQL Open Data Institute Operation support Systems Over the Top Radio-frequency identification Software-as-a-service Small-Medium Enterprise Short Message Service Search for Extra-Terrestrial Intelligence Technical Working Group of the BIG project User-generated content Unstructured Supplementary Service Data extensiblemarkup Language BIG consortium Page 110 of 162

111 4.7. References Telecom sector #TCBlog. (2011, 08). Social CRM a fondo: el análisis multisectorial. Retrieved from Territorio Creativo: Europa Press Releases Rapid. (2009, May). Retrieved from Telecoms: Commission acts on termination rates to boost competition: EurLex. (2011, 02). Retrieved from COMMUNICATION FROM THE COMMISSION TO THE EUROPEAN PARLIAMENT, THE COUNCIL, THE ECONOMIC AND SOCIAL COMMITTEE AND THE COMMITTEE OF THE REGIONS The open internet and net neutrality in Europe: European Communications Magazine Q2. (2012, Mar). Retrieved from European Communications Magazine: Amsterdam, C. N. (n.d.). City Net Amsterdam. Retrieved from Cisco. (2013, 02). Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, Retrieved from c html ComputerWorld. (2012, 10). O2 mobile customer data to be sold to third parties. Retrieved from ComputerWorld UK: mobile-customer-data-to-be-sold-to-third-parties/ DigitalRoute. (2012, May). How big data is challenging mobile service mediation. Retrieved from DigitalRoute: Dresden University of Technology. (n.d.). Privacy Implications of the Internet of Things. Economist, T. (2012, 03). Retrieved from Big data. Lessons from the leaders: etom. (n.d.). TM Forum. Retrieved from Business Process Framework: European Commission. (2012). Retrieved from How will the EU s reform adapt data protection rules to new technological developments?: European Communications Maganize. (2012). Big Data survey: new revenue streams top CEM as biggest opportunity. European Communications Maganize. European Communications Magazine. (2012, 03). European Communications Magazine Q2. Retrieved from European Communications Magazine: Gartner. (2012, Nov). Press Release. Retrieved from Gartner Says Worldwide Sales of Mobile Phones Declined 3 Percent in Third Quarter of 2012; Smartphone Sales Increased 47 Percent: Gartner Press Release. (2012, Oct). Gartner Says Big Data Creates Big Jobs: 4.4 Million IT Jobs Globally to Support Big Data By Retrieved from BIG consortium Page 111 of 162

112 IBM. (n.d.). Telco Five telling years, four future scenarios. Retrieved from ibm.com/services/us/gbs/bus/html/ibv-telco2015.html?cntxt=a IDC. (2012). Balancing Business Innovation With IT Cost Control: Big Data Adoption and Opportunity in EMEA. IDC. Kramer, J. (2010). How to write references and citations. New York: Grorring Ed. Meissner, D. P. (2011, Feb). Roadmap to Operational Excellence for Next Generation Mobile Networks. Retrieved from socrates.eu/files/workshop2/socrates_final%20workshop_peter%20meissner.pdf QoSMOS. (n.d.). Retrieved from Radio Access and Spectrum. A white paper on spectrum sharing. October 2012: SAS. (2012, 04). Cebr Report for SAS. Retrieved from Data equity. Unlocking the value of big data: Telefónica. (n.d.). Telefónica Dynamic Insights. Retrieved from Thomas, S. D. (2012, October 1). 3G. Retrieved from Vodafone And O2's Network Sharing Approved: WEF. (n.d.). World Economic Forum Report Retrieved from ZTE. (2012, 11). Big Data Brings Opportunities to Telecom Operators. Retrieved from ZTE: 1_ html Media and entertainment sectors Brian Solis, The future of marketing starts with publishing Kathy Klotz-Guest, EC=SC (Every company must be a storytelling company) McKinsey Global Institute, Big data: The next frontier for innovation, competition and productivity on [1] Eurostat Information Society [2] Ofcom, UK Fixed Broadband Map [3] BARB TV viewing methodology [4] Victor Luckerson, What the Library of Congress plans to do with all your tweets [5] Information from Big Data London meeting, 20th February [6] Paul Mason, Thommo s big themes: 1: Martini media BIG consortium Page 112 of 162

113 [7] IFPI Digital Music Report 2013, Engine of a digital world [8] William Shaw, Wired magazine, Cash machine: Could Wonga transform personal finance? [9] Jason Fitzpatrick, Lifehacker.com, If you re not paying for it, you re the product [10] Wikipedia, Facebook privacy concerns [11] Moritz Jaeger, Zdnet.com, Facebook wins European court battle over right to fake names / [12] Francoise Gilbert, SearchCloudSecurity.com, The proposed EU data protection regulation and its impact on cloud users (requires free login) [13] Liana B. Baker and Jim Finkle, Reuters, Sony PlayStation suffers massive data breach [14] Warwick Ashford, Computerweekly.com, Linkedin data breach costs more than $1m [15] Gary Flood, Information Week, Sony slapped with $390,000 UK data breach fine [16] Manuel Sevilla, Capgemini, Big Data vendors and technologies, the list! [17] Martin Strohbach, BIG project, Notes from interview with Nathan Marz [18] Jeff Kelly, Wikibon.org, Big Data vendor revenue and market forecast [19] Information from Nigel Shadbolt talk Open data, what s the use? at the British Library, 19th March [20] Castleford.com.au, Internal linking lessons from old media [21] Google Fusion tables [22] OpenRefine project on Github [23] Tania Branigan, Guardian.co.uk, South Korea on alert for cyber-attacks after major network goes down [24] Gordon Macmillan, The Wall blog, Official BBC Weather and Arabia Twitter accounts are hacked by pro-assad supporters [25] Timothy Prickett Morgan, The Register, Bad generator and bugs take out Amazon cloud [26] Timothy Prickett Morgan, The Register, Amazon s Virginia cloud data center knocked out again [27] Romain Dillet, Techcrunch, Update: Amazon Web Services down in North Virginia Reddit, Pinterest, Airbnb, Foursquare, Minecraft and others affected BIG consortium Page 113 of 162

114 [28] Adrian Lane, Dark Reading, Security implications of Big data strategies [29] Cloud Security Alliance, Open survey: CIRRUS: Towards an EU framework for cloud service provider security [30] Simon Rogers, Guardian.co.uk, Data visualisation DIY: our top tools [31] Mac Slocum, O Reilly Strata, Big data in Europe [32] Mastodon C [33] Information from Women in Data talk, 22nd February [34] Seshu Edala, Forbes, Big Data analytics: not just for big business anymore [35] Edd Dumbill, O Reilly Strata, What is Big Data? An introduction to the big data landscape [36] Infochimps, visual.ly, CIOs & Big Data [37] Gil Press, Forbes, Infochimps new CEO on the next big data acquisition and getting rid of data scientists [38] Galaxyzoo.org [39] John Burn-Murdoch, Guardian.co.uk, Datablog + MPs expenses BIG consortium Page 114 of 162

115 5. Retail, Transport, Manufacturing Retail, Transport, and Manufacturing belong the rather traditional industries. The increasing digitalization and datafication of these industries already produces big data in terms of volume and velocity. The sector stakeholders look into opportunities of how to leverage also the existing variety of data source in their business networks. In this first draft document of sector requisites, we put together findings analysed from research and first interviews that are being conducted with decision makers and experts. In the forthcoming months of the project we will continue with the interviews and stakeholder engagement activities in order to reveal requirements of these sectors towards big data that are widely agreed upon. The findings in this first draft version may not be representative regarding all aspects and may reflect personal opinions of the interviewee at times. But as decision makers they also reflect on the current positions in the sector regarding general big data aspects. The sectors each for itself are much more complex and may host a myriad of potential big data application domains. In this first draft we selectively look at the segments of the sectors that have been identified as big data related in the first round of research and interviews. It will be the matter of forthcoming discussions in the sector forums, whether we should concentrate on a few application scenarios and how to choose them or whether we should cover as much breadth as possible of these sectors Introduction Retail Retailers are faced with a growing amount of data and the availability of heterogeneous data sources. This can be seen not only as a challenge but also as an opportunity. New marketing strategies and business models, such as electronic- and mobile-commerce show the changing requirements and expectations of the new generation of consumers. For example, stationary retailers with their physical hypermarkets have to rethink their business models to remain competitive and to gain future competitive advantage against online retail models. New technology trends such as Internet of Things & Services can be used to attract new customers by creating new marketing concepts and business strategies. Most retailers strengthen their business models towards multi-channel-merchandising. Classic metrics such as inventory information (e.g. Stock Keeping Units) or Point of Sale analysis (e.g. which products have been sold and when) still play an important role, but the knowledge about the customers becomes more and more important. The potential of tailored and personalized customer communication, so called Precision Retailing is one of the hot topics for marketing experts. The identification and understanding of customer needs and behaviour ask for collecting, processing and analysing large amounts of data from different sources. In the near future, increased interaction between retailers and consumers will generate Big Data which can be used to provide individual services and recommendations for the new generation of consumers. For a better understanding of their customers, retailers have to collect big amount of different data sets on an individual level to launch e.g. personalized loyalty programs. Transport The transport sector has many facets and can be analysed along the axis of mode, i.e. road, rail, air, etc., elements comprising the sector, namely infrastructure, vehicles, and operation, and the function of transportation, which either is freight transport or passenger transport ( Transport, 2013). BIG consortium Page 115 of 162

116 In this first draft version we focus on the multimodal transportation of passengers in an urban setting, since it touches on a wide range of the possible aspects of transportation. Multimodal transportation needs to satisfy constraints such as being comfortable to use and offering the most economical or green way to travel according to the personal preferences and locationbased options, including public transportation, rail, car-sharing, or by bicycle for example. The better connected all of the elements are the more feasible the case of multimodal transport becomes. As with all the other industrial sectors the interconnectedness is achieved by sensors and devices enabling data exchange, especially location data. According to the EU Directive 2010/40/EU urban multimodal transport concepts fall under the category of Intelligent Transport Systems, which the directive defines as [ ] advanced applications which without embodying intelligence as such aim to provide innovative services relating to different modes of transport and traffic management and enable various users to be better informed and make safer, more coordinated and smarter use of transport networks ( Directive 2010/40/EU, 2010). Efficient multimodal transportation is also a central goal of the EU transportation policy for 2020 (European Commission, 2011). The Commission s vision considers three major transport segments: medium distances, long distances and urban transport. Depending on the size of the area and the involved components and users, it is conceivable that urban multimodal transportation is one of the more complex big data settings in the transportation sector especially because it involves end users as well as multiple organizations exchanging data across their boundaries. Manufacturing In the Manufacturing Sector, the biggest trend is Industry 4.0, also known as Integrated Industry. Industry 4.0 introduces a wide variety of consequences, including the addition of sensors and actuators to the production equipment and IDs and memory to the products. As one of the core issues, a new need for data analytics arises. The amount of data not only rises (as in most other sectors), but also changes dramatically in nature. The advent of newly connected and instrumented machinery creates new types of data with a strong requirement for data integration. At the core of Industry 4.0 is a requirement for fully integrated and mapped networking, where there is no (or little) room for unstructured data one of the key characteristics of Big Data. On the periphery of the current developments in I4.0, the situation evolves more like in other sectors with new business perspectives forming and bringing similar requirements on Big Data technologies Industrial Background Characteristics of the European Retail, Transport, and Manufacturing Sectors Retail The retail sector is a key sector for the European economy: EU retail services contribute 454 billion to the EU value added, which accounts for 4.3% of total EU value added and employ no less than 18.6 million citizens (EU Commission, 2013). Transport The transport sector accounts for more than 5 per cent of the EU gross domestic product and employs around 10 million people as of 2011 (European Commission, 2011). The paramount goal of EU s current transport policy is to offer high-quality mobility services while using BIG consortium Page 116 of 162

117 resources more efficiently. The 2001 s white paper on European transport policy was clearly focusing on congestion and the external costs to congestion for a more sustainable transportation for 2010 (European Commission, 2001). That focus has shifted towards resource efficiency of transportation for Resource efficiency in terms of transportation requires better integration and linkage of the different transportation modes through information exchange. The data is to be exchanged among all their elements, such as infrastructure, vehicles and operations in order to choose the most efficient way to travel for each passenger whereas curbing mobility is not an option (European Commission, 2011). Manufacturing The current manufacturing sector is characterised by a spectrum spanning from mass production to customised, individual production. Automation is typically found in mass production, although individual production is more and more supported by adaptable tools and technologies. Mass production is not necessarily automated, in particular in countries with low labour costs, many steps are still manual labour. However, automation technology is ever more widespread and helping mass production in developing countries to catch up in the area of manufacturing quality, where manufacturing in developed countries is in the lead. Besides quality, manufacturing in developed countries can offer more flexibility and customised products through a better qualified workforce and manufacturing technologies. To develop this aspect further with technology support will be one of the key challenges in the near future. Continuing automation is changing the classical balance between quality and price in manufacturing. The quality of mass production in low wage countries is rising. This challenges high wage countries to develop their strengths in quality and customisation further. With the on-going integration of value chains in manufacturing within large multi-national corporations as well as between the many players in some value chains requirements for cooperation, such as standards and norms, market places etc. become highly important. A study by General Electric, with data from the World Bank estimates the size of the market affected by Industry 4.0 as follows: When traditional industry is combined with the transportation and health services sectors, about 46 percent of the global economy or $32.3 trillion in global output can benefit from the Industrial Internet. As the global economy grows and industry grows, this number will grow as well. By 2025, we estimate that the share of the industrial sector (defined here broadly) will grow to approximately 50 percent of the global economy or $82 trillion of future global output in nominal dollars. (General Electric, 2012) Industry 4.0 Seen in its historical background, the current developments in Big Data in manufacturing amount to a fourth industrial revolution. The first industrial revolution, starting around 1780, was triggered by the invention of the steam engine, the use of coal as an energy source and the introduction of the first mechanical manufacturing facilities. The second revolution, starting around 1900, was triggered by the introduction of electrical energy, mass production techniques (in particular the conveyor belt), all in large capital goods industries like steel and oil. The third revolution, starting as recently as the 1970s, was triggered by the introduction of electronic systems and computer technologies (the microcontroller), enabling automated manufacturing processes on a global scale. It had started a trend towards highly efficient production on a scope never seen before that has now matured but is still growing. Multi-national corporations are building highly automated factories on all continents and integrate small and medium enterprises within their supplier networks by expanding ever larger industrial infrastructures with the Internets of Everything (IoE), i.e., the Internet of Things, Cloud Computing, and the Internet of Services. They are creating a direct and (in many cases) real-time connection between the virtual and the physical worlds. Thus the term Cyber-Physical Systems (CPS) is used besides Industry 4.0 to describe these developments. Industry 4.0 has a number of manufacturing and production plant specific aspects, in particular in the area of interfaces to the physical world. On the other hand, the data-related aspects apply BIG consortium Page 117 of 162

118 similarly to other areas of Big Data. These aspects include the acquisition, storage, curation, analysis, and usage of data as well as corresponding issues like interfaces, visualisation, human assistance systems, integration with business processes and regulatory and legal issues. Industry 4.0 is thus a strictly larger development, including Big Data as a core aspect and extending it into the physical world of products and production. Within existing manufacturing plants, the core challenges that can be addressed by CPS and in particular Big Data are: Vertical integration of production steps, infrastructure, logistics, human resources and human assistance Flexible and reconfigurable production lot size 1 : customised product ad-hoc networking modularisation of production chains intelligent modelling and description of production plants Efficient and energy-saving production Operator qualifications, support systems, and digitising corporate (manufacturing) knowledge Horizontal integration Within the entire manufacturing infrastructure, the core challenges that can be addressed (and in turn are raised by) CPS and Big Data are: Horizontal integration of the value chain Adapt and evolve the business processes Create new business processes and even business models by including cooperation partners Protection of IP Standards and norms as the basis for cooperation and integration Strategies for developing human resources Vertical and horizontal integration as described above will in the manufacturing sector be centred around the engineering process. The complete life cycle of a product or product family will be driven by integrated engineering where the planning, production, service and recycling steps produce data and access data. This creates new requirements on the data models, the connection between physical and virtual world (Internet of Things, i.e., smart products with IDs and object memories), interfaces between the smart product and its production system Market Impact and Competition Retail Concerning new technologies, the retail sector in Europe can be seen rather as a follower than as a pioneer in contrast to other core sectors of the European economy like manufacturing. Due to the fierce competition, the retail sector with its competitors is under high pressure. When we have for example a closer look into the stationary retail market, we can see that most of the retailers are focused on just a few countries. This limited expansion strategy is a result of country specific market conditions and the strong competition in the sector. For example there BIG consortium Page 118 of 162

119 are big margin differences in some product segments, e.g. in the food sector, within European countries. The impact of Big Data on retail has been constantly increased in recent years. The main advantages are higher efficiency and growing margins. A McKinsey Global Institute Report (Manyika, 2011) identified five potential ways to create value from Big Data that are a result of an evaluation of the US retail sector, but based on the general nature of the findings, the results can also be adopted for the retail sector in Europe: 1. Big Data can unlock significant value by making information transparent and usable at much higher frequency. 2. As organizations create and store more transactional data in digital form, they can collect more accurate and detailed performance information on everything from product inventories to staff management, and therefore expose variability and boost performance. Leading companies are using data collection and analysis to conduct controlled experiments to make better management decisions; others are using data for basic low-frequency forecasting to high-frequency nowcasting to adjust their business levers just in time. 3. Big Data allows ever-narrower segmentation of customers and therefore much more precisely tailored products or services. When retailers exactly know in what their customers are interested in, offers and services can be provided more personalized. 4. Sophisticated analytics can substantially improve decision-making. 5. Big Data can be used to improve the development of the next generation of products and services. For instance, manufacturers are using data obtained from sensors embedded in products to create innovative after-sales service offerings such as proactive maintenance (preventive measures that take place before a failure occurs or is even noticed). Transport Transport sector, similar to most other cyber-physical industries, is only warming up to big data, as also a recent Gartner report shows (Kart, 2012). The big data opportunity heat map utilized in the report exposes transport as one of the cooler regions, besides retail and utilities. Of course, many of the transportation associated companies have been using data and analytics massively for planning and scheduling applications, such as logistics or airplane route planning as mentioned in the report. Maybe this is one of the reasons, why the big data has not been as hyped as in other industries. But it must be clear that the masses and frequencies of data generated by a continuously growing number of sensors ultimately manifest as big data. The report identifies the most significant and imminent areas for big data impact as: Supply chain risk management in freight and logistics for maximizing efficiency Airline yield management and revenue optimization through demand prediction Maintenance, repair and overhaul applications in air and rail to pre-emptively address issues by monitoring sensor data A major emerging area is identified as the perfect trip (Kart, 2012), i.e. having a 360 degree view of the traveller and its preferences to personalize the trips through so-called masscustomization. This can also be seen as an indicator that urban multimodal transport is a valuable starting point to analyse the sector s big data requirements in this project s context. Regarding opening up to the potential efficiency increases through big data, the notion of open data also plays an important role in the market. Open data is the idea that data needs to be open and freely available such that it can be reused to create ecosystems of services around this open data. The rationale behind open data is that the services are valuable rater than the raw data itself. Especially considering urban last mile travel, location-based services built BIG consortium Page 119 of 162

120 around open mobility data can both manifest and unlock big data opportunities for a wide range of the stakeholders. Manufacturing In most European countries, the impact of Big Data in the context of Industry 4.0 will be the requirement to keep and extend current market advances in the areas of high-quality, high-tech and customised products. As automation will help other competitors to raise quality from manual manufacturing levels, the focus will be on high-tech and customised products. High-tech or smart products with individual IDs and object memories (Internet of Things) will be used in integrated environments, e.g., the automobile and its life cycle. In a changing market, the optimal strategy will be a dual strategy, combining the creation of market leaders with the establishment of leading markets. Market leaders build on top of developed and developing base technologies, i.e., Big Data technologies and other, CPSrelated manufacturing technologies. In the manufacturing sector, market leaders are machine and plant engineering and construction companies, manufacturers of automation technology and corresponding integrators and service companies in the ICT sector. The leading markets in Europe and the world are the production facilities, including their network of suppliers and services, i.e., their entire value chain. Both perspectives apply to the European market and should thus be combined in a strategy for adapting the manufacturing sector to the Big Data and Industry 4.0 related changes Stakeholders: Roles and Interest Retail For a better understanding of the value of Big Data, we identified the stakeholders within retail companies that are interested in using Big Data and for what purposes. 1. Sales department Purpose: optimization of pricings, product placement and shelf management 2. Purchasing department Purpose: efficient supplier negotiations 3. Marketing department Purpose: efficient customer segmentation, precision retailing, dialogue marketing, context- and situation-aware marketing and personalized recommendation 4. IT department Purpose: efficient data acquisition, handling and analysis tools 5. Logistics department Purpose: optimization of logistic processes and inventory management The departments and business units in retail with the highest value and challenges concerning Big Data seems to be Marketing and IT. From the retailer`s point of view, the interconnection and interchange of heterogeneous data between the mentioned departments and especially the task-based evaluation seems to be their biggest challenge in the Big Data issue. Transport As stated before there are many facets to transport and big data and as such many different stakeholders. Thus, it is necessary to selectively analyse application domains and look at the stakeholders and their interests within the application scenarios of the domain. Intelligent Transport Systems are such an application domain comprising the linkage of vehicles, infrastructure, transport operators, and transport users and as such are typical breeding grounds of big data. At the same time, being a system of systems, they involve a wide range of stakeholders with complex network effects and side effects. Figure 33 illustrates this BIG consortium Page 120 of 162

121 within the system of a sustainable connected city with the stakeholders for energy-efficient multimodal door-to-door transportation as an example: Figure 33 Illustration of a sustainable connected city stakeholders (Rusitschka, 2010) for energy-efficient multimodal transportation in the last mile Transport end users pay for the full costs of transport in exchange for less congestion, less energy waste, more information, better service and more safety. They provide location data and preferences. City node operators provide destination-related data and resources such as parking lots with e-car charging stations. They, such as hotels, shopping centres, exhibition halls, railways, etc., may be the main drivers behind the perfect trip experience for end users. Because, similar to the end users, they have an equal interest that end users arrive easily at their destination point. Municipality/municipal utility may be the operators of some city nodes such as public transportations stations such as metro, bus, airport etc. They need to provide a close integration through data sharing policies and exchange with private transportation. The municipal energy utilities are in charge of providing the energy-efficient multimodal transportation with clean energy from renewable or distributed energy resources because these are more efficient considering the primary energy source/use. The municipality additionally has interest in both satisfying the end users as well as the EU policies on transportation and energy efficiency. Key performance indicators within the big data streams are useful for this stakeholder group. Energy management service provider offers to manage the entire fleet of an e-car sharing provider. As an example of a typical new player that big data brings into play, this service provider must retrieve actionable information from location-based data on e-cars being shared and parked throughout the city, available charging points and which cars can be used as batteries to reduce peak demand or oversupply of local wind and solar energy in-feed. Value-added information service providers offer data and combination of data or derived information and services based on this information. Such a service could be e-car route planner that additionally takes into account weather data to propose parking lots not only close to destination but also having free solar energy for charging due to oversupply or that suggest to leave the e-car at home and another multimodal alternative due to high energy prices. BIG consortium Page 121 of 162

122 5.3. Big Data Application Scenarios Retail There are several scenarios for the retail sector in which Big Data plays an important role. The findings that are described in the following are the result of first conducted interviews with decision makers from the retail sector. Their statements not only reflect personal opinions but also give first evidences for sector representative positions that we will critically review and explore in more detail by forthcoming investigations during this project. Here are some of the findings based on first interviews: Transport Improving Enterprise Resource Planning and management of product database. Retrieving and assembling of different product information (e.g. ingredients, nutrition information, best before dates, pictures) from different and heterogeneous data sources. These unstructured data sets have to be reliable, up to date and trustworthy, to mention just three important values. Especially in the food sector, the required product information is often not provided by manufacturers. This information must fulfil the requirements for multi-channel merchandising and they have to be up to date at every time. To get an idea of the data dimension, a full-range retailer has information of more than 2 million products stored in their data warehouse. Planning of store and shelf location. For stationary retailers, the planning of a new store requires access to different data sets, such as demographical distribution in the task region and detailed information about potential customers. These parameters have to be taken into account for planning a new store. Within a store, the shelf locations, the so called floor planning, is based on path analysis and heat maps, and requires additional information of the customer`s behaviour. To measure and to analyse these massive amount data, can also be seen as challenges of the Big Data topic. Better customer service and dialogue marketing. The most interesting application scenario with the highest value seems to be in collecting a comprehensive knowledge about the customer, his behaviour and his expectation to setup an adhoc and contextaware customer recommendation system to provide smart shopping services in the future. To fulfil this task, retailers need to know who their customers are. Not only on a cluster or segmentation level, but also on a personalized and individual level. In addition to classic data acquisition, social platforms provide a novel knowledge base that needs new evaluation techniques. The challenge is to identify, to acquire (with respect to legal restrictions) and to analyse these heterogeneous data sets and at the end to semantically interpret the results. Transport is a widely facetted sector with potentially a wide range of big data applications. In this first version of the requisites we look at multimodal transportation specifically. Multimodal transportation planning is business as usual. The difference that makes it now require big data technologies is that it can now be personalized based on the end user mobility and preference data. Traditional planning normally focuses on a limited set of factors for efficiency evaluation. Many impacts such as parking costs and traveller s preferences for alternative modes etc. are often excluded because they have been difficult to quantify. Acquiring and utilizing data from location-based services on smart phones, other forms of personalization through integrating social media or crowdsourcing today enables the real-time quantification of these factors through streams of data. A very important data source in this personalized multimodal transportation scenario is the smart card used for easy payment (Lathia, 2011). In many cities some sort of a smart card system is already being setup for public transportation needs. In the multimodal setting, it should certainly be integrated with other modes, including car sharing services, and rental city BIG consortium Page 122 of 162

123 bikes etc. Additionally, this does not necessarily have to be a physical payment card, smart phones should be able to host similar applications for a wide spread usage. The mobility data captures geo-location data such as starting points and destinations, as well as travel time, stops, etc. Depending on the metropolitan area and the size of the user base mobility data can be very challenging with respect to volume, velocity. Depending on the big data application it is conceivable that mobility data can be integrated with a considerable variety of data sources, such as from weather or location-based services. This data can be used in many end user directed ways including improving the travel choices of end users for sustainability, give them personalized and contextualized information during their travel from door-to-door etc. On the other hand the data can be also be used in a secondary level by municipality to influence the behaviour of the citizens, travellers to achieve goals of the city, such as less congestion, less green-house gasses, etc. through personalized incentives and adaptive pricing schemes based on the actual movement of people in the city. Manufacturing Production Plant Planning With the integration of data and data models in the entire engineering life cycle comes the opportunity to extend integrated approaches to the planning of entire production plants. Existing simulation models must be adapted and evolved to support all steps of planning new production facilities that allow for modularized, flexible and reconfigurable manufacturing processes as are supported by cyber-physical systems. Such models and simulation environments will be used beyond the planning process. They must rely on the same data sets and data models that are used in controlling and managing the manufacturing facilities once they are operation. Large corporations, e.g., in the automotive sector (BMW), have already started specific efforts to extend their production plant planning approaches and integrate them with existing production simulation tools. Obviously, such approaches will create and must use large data sets. With new, smart products and cyper-physical production systems and the corresponding wealth of data is a typical Big Data scenario. Predictive Maintenance Cyber-Physical Production Systems (CPPS) will have digital sensors and actuators that can control and adapt the production process in a flexible manner. The sensors also provide large data sets that can be used to support maintenance of the production systems. Using predictive analysis, expected failure of machine parts can be predicted with greater accuracy and reliability. Current maintenance schedules follow a worst-case scenario and call for a scheduled down-time for maintenance, including the exchange of often expensive tools in regular intervals. With on-line supervision of the machine state, such maintenance intervals can be made flexible and be extended to apply to the actual state of the system as opposed to the current worst-case scenario. The type of Big Data analysis that is applied to supervise the machine state can also be applied to unscheduled maintenance. Some parts are expected to fail, and scheduled maintenance only balances the cost of exchanging a still working tool against the cost of down-time of the manufacturing equipment. Unscheduled maintenance has high costs as either the down-time causes loss of production or it can only be minimized by special, expensive services. Predictive analysis of sensor data will be able to detect characteristic changes and be able to predict the point in time when failure is imminent, allowing for an individual, scheduled maintenance that is not prone to the high costs of unscheduled maintenance. BIG consortium Page 123 of 162

124 A challenge to data analytics is the collection of sufficiently large data sets for predictive maintenance. Data sets from similar production systems are not necessarily applicable without data transformation. Machine manufacturers will have a high interest in access to the data that their machinery generates in their customers production plants. This will raise new challenges in the field of data protection and data market places. Big Data User Interfaces The potential of Big Data analytics in manufacturing environments, be it the physical production or the extended value chain (business processes), can only be exploited if human operators and managers have efficient access to the data and have efficient tools available to exploit the data. In a production environment, the data tools may only take attention from actual production to the extent that they provide a quantifiable benefit. See, e.g., the use case for predictive maintenance. In the area of qualification and capturing corporate knowledge, Big Data offers many opportunities for on-the-job qualification with new assistive technology that can guide workers step-by-step through the individualised manufacturing sequences needed for customised production, up to the extreme case of lot size 1. In a digitized environment, such assistive technology can take the current situation and environment into account: machine state, special requirements of the product, individual information about the worker, etc Big Data Requirements In general, some characteristics of Big Data for the retail sector can be summarized as follows: Big data is high-volume, high-velocity and high-variety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision making. (Beyer, 2012). The following sub-chapters give a first insight of findings based on conducted and evaluated interviews of decision-makers from IT, controlling and marketing departments Data Acquisition Retail Based on first conducted interviews, we identified two types of collected data: the classical data for the accounting and controlling department (detailed sales volume), and data for the marketing department (information about the consumer and its behaviour). The acquired data includes all information that is relevant for the business cases. Besides data for the accounting department and product information, the acquisition of information about customers for campaign optimization increased in the last few years. Data from different sources and decentralized data bases are stored in a data warehouse or a central repository. The amount of data of a full-range retailer has an overall volume of more than 2 petabyte. Transport Personalization is the holy grail of effective multimodal urban transportation. However, it requires the consent and assurance of the end user that the data used for personalization is safe. Without the acquisition of mobility data transportation remains in the realm of business as usual, utilizing less accurate models instead of real-time measurement of movement flows. BIG consortium Page 124 of 162

125 In addition, the data acquisition from various sources, such as from other services providers, is important to ensure the seamless integration of all the personally preferred modes of transport by each end user. Indeed, from the end user perspective the acquisition of data on all the available options need to be transparent and easy to use. This would translate into sophisticated data sharing policies for all involved parties. Open data can facilitate data sharing and acquisition immensely, especially in the beginning phases of establishing such multimodal transportation ecosystems. Manufacturing In Cyber-Physical Systems, sensors are an integrated asset and support data acquisition easily. Accessing sensor data is also not a challenge as the integration of data in Industry 4.0 is already an on-going development. The challenges in the manufacturing sector will thus be the compatibility of data. This includes data integration within a factory and more challenging data integration throughout the entire value chain across multiple business partners. Within a single company, the immediate use case will be the integration of actual production Big Data with resource data existing in ERP systems. The latter are often not on the level of precision and detail that the former will be in the near future. Thus the core challenge in data acquisition is meta data (semantic data) and data standards allowing for integration with other parts of the production value chain Data Analysis Retail For statistical and controlling purposes, standard Business Intelligence software is often used (e.g. Microsoft BI Server). By using cubes, data base queries can be made task-oriented by composing rules. This type of analysis is especially used for controlling and business management. For marketing purposes, customer information is analysed by special marketing software for campaign optimization and customer acquisition. As an example, software by SAS was mentioned and especially the packages Enterprise Miner and Enterprise Guide. These software tools work fine for structured data analytics, but additional unstructured data sources call for new techniques that also have to be easy to use. Transport Data analytics needs to be privacy-preserving. The data is mainly well-structured, sometimes time-stamped mobility data. The faster analytics the better usage can be made of the insight of analysis. The trend here is also towards scalable analytics on both the historical as well as on the real-time streaming mobility data. Additional challenges of data analytics is associated with the analysis of data streams coming from various sources as is typical in such cooperative application scenarios as the personalized multimodal transportation. These sources typically expose unstructured data such as from location-based social services like foursquare Manufacturing Two areas for data analysis have been identified in the use cases: plant and production simulation and failure prediction. In production simulation, the goals of data analysis are efficiency, including all resources such as humans, energy, material, transport logistics. Visualisation of the data and analysis results will be a key issue. As a core technology for predictive analysis, machine learning techniques, e.g., for failure and maintenance prediction, are of particular importance in the manufacturing sector. BIG consortium Page 125 of 162

126 With the advent of smart products and production, data analysis as an extension of classical business intelligence (BI) has an opportunity to connect market data with production data, e.g., to use Big Data based predictions on demand to influence production planning Data Storage Retail Different data base systems are commonly used to store different data sets. Which one is used, depends on the processing and analysis steps performed on the data. The data itself is often stored in multi-dimensional cubes instead of traditional relational database. The advantage of cubes is the rule-based summarization and grouping of dimensions which makes the processing and analysis of big data sets manageable. Transport Storing the possibly huge amounts of data is assumed to be manageable through the big data technologies, such as Hadoop or other commercial offers. Concerns are raised on how to realize the integration with location-based service data and other data storage systems. Data silos are the typical road blocks to unlocking the potential of big data also in this sector. However, it is not clear that big data technologies provide a sufficient answer to this. Manufacturing There are currently no specific requirements on data storage that differ from other sectors Data Curation Retail The data curation is handled by the IT department of the retail company, which is mostly located in the headquarters. A couple of years ago, the challenge of handling big and large data sets were the costs of hard drives. Today, the memory processing and data bus speed seem to be the new bottlenecks. Transport Crowdsourcing the curation of big data seems to be an interesting option in the application scenario of personalized multimodal urban transportation with location-based social services. Tiramisu tiramisutransit.com, as an example, is an application that uses crowdsourcing in order to update bus schedules in real-time. Especially during rush hour bus schedules are unreliable. The more people use Tiramisu the more useful and precise the data becomes. Systems offering real-time GPS-based bus tracking are much more costly. Here, crowdsourcing offers additional information on how full the late bus already will be: data that other services cannot offer easily. Manufacturing There are two aspects of Big Data in the manufacturing sector that pertain to data curation. First, since sensor data from manufacturing equipment will necessarily be standardised to a high degree, the need for data curation is shifted towards the standardisation process. Second, the combination of production data with other business and market data (business intelligence) is not different in principle from other industrial sectors. BIG consortium Page 126 of 162

127 5.4.5 Data Usage Retail Sales volume and receipt analysis is used for reporting and management purposes. Data about consumers and their behavior is used for marketing optimization, e.g. dialog marketing and tailored recommendation that should be situation and context-aware in the future. The need of adhoc data analysis to provide situation-aware recommendation and advertising is another point that was mentioned by interview partners. Transport The usage of mobility data from personalized multimodal transportation can be used in many ways, such as improving public transportation, offering contextual travel information, influencing traveler behavior to achieve environmental goals and transportation policies etc. Open data could additionally boost business innovation in the intersection of big data and transport. Manufacturing The core requirements on specific data usage technologies are similar to other industrial sectors. New interfaces must follow the general goal of simplification of data sets and analysis results, e.g., by visualisation and navigation techniques. Specific to the manufacturing sector are the working environments that are constraint by environmental issues like noise dirt, handsoff situations etc. and thus have a more extensive need for multimodal interfaces. In a noisy and dirty environment, neither spoken input nor touch input (keyboard or touch-screen) might be viable, calling for a gesture-based input. As in all industrial sectors, the availability of individual user data allows for the adaptation of interfaces to individual workers. In the manufacturing sector, these interfaces can be extended and integrated with e-learning components and step-by-step instructions, see the section on Big Data Interfaces. Use cases for usage of Big Data concentrate on clusters including: increased efficiency (e.g., in planning and logistics), adaptability (e.g., resilience, slot size 1, customer integrated engineering), adaptive maintenance and an increase of worker qualification Conclusion and Next Steps Retail The task of the retail sector can be summarized as: reorganizing existing Business Intelligence for retail analytics and lift it up to the next level towards a more context-sensitive, consumerand task-oriented analytics and recommendation tool for retailer-consumer dialog marketing. The next steps include interviews with identified stakeholders and decision makers to get a more detailed insight into the requirements and expectations in Big Data. For this purpose the interviews will aim to experts of the business units on the one hand and on the other hand we will conduct interviews with managers to get a more strategic and business point of view. As a result of the interviews, we will show the potential of Big Data in a detailed application scenario that has the highest impact regarding Big Data in the retail sector. Transport This first draft focused on the big data application scenario of personalized multimodal urban transportation within the application domain of Intelligent Transport Systems. There are other areas such as logistics, airport operations which are more established domains with equally BIG consortium Page 127 of 162

128 abundant data volumes and challenging analytical questions to answer beyond organizational boundaries. The next steps would involve identifying interview partners from a wider range of big data application domains within the transport sector. Equally interesting are the overlap scenarios of multiple sectors such as energy and transport in the resource-efficient urban transportation scenario or transport for energy efficient supply chain logistics in retail. Manufacturing The core requirements in the manufacturing sector are the customisation of products and production, the integration of production in the larger product value chain, and the development of smart products. The European manufacturing sector can be both, a market leader using Big Data in the context of Industry 4.0 and a leading market, where manufacturing Big Data is integrated in the larger product value chain and smart products can be put to use Abbreviations and acronyms BI CPS CPPS IoE Business Intelligence Cyber-Physical Systems Cyber-Physical Production Systems Internet of Everything 5.7. References Retail EU Commission. (2013, April 8). The EU Single Market: Retail Services. Retrieved from Beyer, M., Laney, D. (2012, June 21). The Importance of 'Big Data': A Definition. Gartner Report G Manyika J., Chui M., Brown B., Bughin J., dobbs R., Roxburgh C., & Byers A. H. (June 2011). Big data: The next frontier for innovation, competition,and productivity. McKinsey Global Institute. Transport Directive 2010/40/EU of the European Parliament and of the Council on the Framework for the Deployment of Intelligent Transport Systems in the Field of Road Transport and for Interfaces with Other Modes of Transport. (2010). Official Journal of the European Union (Legislative acts) L207/1. Retrieved from European Commission. (2001). White Paper European transport policy for 2010: time to decide. Retrieved from /EU-transportpolicy2010_en.pdf European Commission Director-General for Mobility and Transport. (2011). White Paper on Transport: Roadmap to a single European transport area Towards a competitive and resource-efficient transport system. doi: / BIG consortium Page 128 of 162

129 Kart L. (2012, July 10). Market Trends: Big Data Opportunities in Vertical Industries. Gartner Market Analysis and Statistics G Lathia N. (2011, September 21). How Smart is Your Smart Card? [Web log post] Urban Mining. Retrieved from Rusitschka S., Eger K., & Gerdes C. (2010, November 9). Cloud Computing als Service Plattform für Smart Cities. Proceedings of the VDE Kongress. Transport. (2013). In Wikipedia, The Free Encyclopedia. Retrieved from Manufacturing McKinsey Global Institute (2011). Big data: The next frontier for innovation, competition,and productivity. Peter C. Evans, Marco Annunziata, Industrial Internet: Pushing the Boundaries of Minds and Machines, General Electric (GE), November 26, 2012 Beyer, M., Laney, D. (2012). The Importance of 'Big Data': A Definition. EU Commission (2013), (April, 8 th 2013). BIG consortium Page 129 of 162

130 6. Energy In this first draft version of the Energy Sector s requisites we look at an emergent segment of the Energy sector which is known as Smart Grid. The Smart Grid domain currently sees a lot of movement towards more data with all accompanying opportunities and challenges. Certainly there are other areas where big data technologies can or are already being made use of. Many operators in today's oil and gas markets receive considerable amounts of real-time monitoring data from the wells. But regarding the mainstreaming of big data application, Smart Grid can be regarded the experimental lab of the Energy market as this first draft reveals. This may also partly be attributed to the end user participation that is an integral part of the Smart Grid. End user participation can also be considered one of the triggers to the traditional Big Data business of the Web Industry. Big data technologies could have an immense impact on such traditional industrial sectors like Energy because there is so much potential for efficiency increase. Many of the processes are manual based on low-tech communication and models with incomplete data or knowledge. From today s perspective, however, there are two equally probable futures to the widespread digitization and datafication of the Energy sector: One has experienced an evolutionary adoption of big data technologies that enable more efficient management and analysis of massive amounts of heterogeneously originating data. The other future has resulted from a new breed of technologies that focus on minimizing at least the costly effects of volume, velocity and variety to Big Energy. The latter outcome may be the result of a truly differentiating factor of the Energy sector, or rather of the cyber-physical industries as opposed to the Web industry: Getting the data from the field into the IT backend is a major challenge related to bandwidth, latencies, as well as privacy and confidentiality concerns where end users and companies are involved Introduction Energy stakeholders are aware that there is ever growing volumes of operational data in their data warehouses and historians. However, the value of data will be realized once it makes sense for business (Berst, 2012). Hence, the fusion of Big Data and Smart Grid requires the overlap of new research in Energy, Informatics as well as Data Sciences, as Figure 34 attempts to illustrate. Figure 34 - Big data in the energy sector needs generalists and specialists that grasp the intersection of Energy, Informatics, and Data Science Big data technologies are web-borne. In order to fully realize their potential in the Energy sector, sector specific peculiarities have to be taken into account. When applying the technologies from the web to cyber-physical industries, there needs to be a deep understanding of the underlying BIG consortium Page 130 of 162

131 system. Eventual-consistency that enables massively scalable data management may not be tolerable in some applications for the operation of power grids for example. In the same way, how the web industry is in search for data scientists that can make sense of big data, the energy industry will be in need of more specialized data scientists that make sense of mass energy data according to the energy businesses needs. To address future challenges of the digital post-industrial era within the Energy sector, we are in need of generalists and specialists working at the very interesting intersection of Energy, Informatics, and Data Sciences Definition of Big Data in the Energy Sector There are of course a lot of structured and unstructured data sources in the Energy sector, since quite a few of the processes include at least partly digitalized workflows. However, the main attention currently is received by the new intelligent devices that can both process and communicate data, such as smart meters at the end users premises or all sorts of measuring intelligent devices along the power lines. Although the data is structured and may as such be easier to handle than for example free text fields in technician s maintenance workflow or a call agent s ticket resolution at the customer care centre, the sheer volume, and velocity are impressive. The Energy Sector is clearly moving towards more ICT-connected power grids. In the future scenarios, price signals, e.g. markets will be heterogeneous sources of valuable data, too. Hence, once the volume and velocity of energy data is manageable the next frontier will be making sense of the vast data from a variety of sources. Specifically identifying and delivering the right scope of insight to the right sink at the right time. There is no value in data unless we use it seems to be a suitable punch line that most in the Energy sector would nod to. However, the roadblocks to data usage are equally manifold as the rest of the sections will sketch. ICT-connected so-called smart grids are not the only big data area of the Industry. Many operators in today's oil and gas markets receive so much monitoring data from the wells that this big data already starts to interfere with the actual work performance. These myriads of realtime data delivering sensors combined with the ever increasing number of wells that an engineer has to manage results in a working environment where engineers permanently react to problems rather than work on preventing them (Dupre, 2012). Certainly also oil and gas is only one of many markets of the sector having to handle what the digital era brings with it. However, electricity enterprises generate higher value added than gas enterprises, especially in Germany (Stawinska, 2009). This is why in this first round of extracting sector requisites we concentrated on the more prominent and impactful issue of mass data in the electricity market and power grids. Through the on-going stakeholder engagement we aim to deliver a holistic definition of Big Data for the entire Energy sector or at least identify commonalities and differentiators across a variety of energy markets. The axes for the current definition of Big Data in the energy sector can hence be reduced to Volume and Velocity. Operational data from sensors and intelligent devices seems to be the driver behind Big Data applications. Only niche players and applications test the frontier of business value generation through big data analytics using a variety of data sources regardless of data origins (see also Market Impact and Competition) Benefits, Advantages & Impact Currently the existence of mass data is not seen as something positive. As will be clearer throughout the sections, there are different views of how useful this big data will be. But there BIG consortium Page 131 of 162

132 are too few successes or failures in the industry to be able to say whether data will transform the Energy sector as it did many others, or not. Energy is still a very conservative sector. It is for certain that the Energy sector is getting increasingly digitalized, but there could be other technical solutions to reaching actionable information without having to manage masses of data. Data ownership issues that big data brings with it are already imminent in the pilot projects that experiment with mass data generating technology such as smart meters or phasor measurement units. But these technologies are not in fully rolled-out, so there is little knowledge about what to do with the data and can be done or allowed by the regulators and end customers for example. Hence, one cannot discuss about the realized impact and advantages. However, throughout this document advantages and benefits of big data is projected, given the following preconditions are realized: The digital technology is fully rolled out in the field, the regulations are clear about what is allowed with the data without restricting the how with respect to technologies, and the incumbents have enough incentives to also consider new business networks and models alongside their established ones. The impact of big data technologies in a nascent digitalizing industry such as energy can be manifold. There is immense potential for efficiency increases through more automation based on the actionable information gained from real-time analytics on big data Industrial Background The Energy Sector is a facetted industry with a long history. In order to provide some background to the most uses and possible big data applications discussed in this first version of the sector s requisites, let us look at the utilities in the Energy sector: The utilities business has been built around the power network. And the power network has had one comforting assumption: it is in a steady-state, i.e. once everything is setup power flows according to laws in a quasi-steady state and under normal condition one does not even need to look at it. And large emergency events that can cause blackouts are highly improbable. Also for the majority of its history the power industry has been powered by vertically integrated utilities, with everything under one control: the power generation, transmission, distribution and the end users of energy. Such a company could well diversify its portfolio without and work with models rather than with real measurement data. In Europe in order to enable better reliability the different transmission system operators joined forces to build an interconnected network that additionally could ride through disturbances because there were more resources available within the same network that were unaffected. But this interconnection also brought more complexities such that transparency and more awareness of the inter-workings needed to be established. Currently the ENTSO-E is responsible for establishing policies for this interconnected network. And the main responsibility is establishing transparency through data sharing platforms such as and the Data Portal (Entso-e, 2013). Although these are far away from real-time data sharing platforms, they probably are the presentiment of energy big data platforms. Another such body concerned with the governance of data in the interconnected system is Another layer of complexity has emerged through the deregulation of all other functions of an energy company other than the transmission and distribution. In Stakeholders: Roles and Interest we also show what this historical change means in terms of explosion of data and data exchange from both technological advancements and possible business innovation. BIG consortium Page 132 of 162

133 6.2.1 Characteristics of the European Energy Sector Let us take the smart metering domain as an exemplary new and emergent big data application in the Energy sector and look at the characteristics of the different advancements in Europe. Italy and the Nordics have had smart meter rollouts now covering over 94 and 70 percent of end energy users respectively. Enel, the Italian utility invested very early in 2006 before any regulations came into place into this technology which delivers real measurements of energy end use. Sweden on the other hand had a successful regulation-driven roll-out. Germany has besides Spain and Portugal is among the smallest installation of smart meters, of only 1.6 per cent (Sentec, 2011). The IMS Research data indicates that Germany and Poland will have a rollout of less than 30% by 2015 (Sentec, 2012). Although other countries have a higher rate of installations, when asked these first mover nations say that Germany is seen as the fastest country. This may be due to the many pilot projects funded in Germany but also because Germany has intensified the discussion on new technologies, data protection and feasibility more than any other European country (Service Insiders, 2011). The European Energy Sector is currently mainly characterized by deregulation and by the differing implementations of the energy related laws from the European Commission. It seems that there is a delicate balance of how much regulation is needed to kick-start technologies that would have otherwise taken much longer to find market penetration. Also there is no dependable measurement of successful governmental subsidies as can be seen from the above example of monetization of energy data. The following section on market impact and competition tries to shed some light on market activity around this data. The main takeaway should be that the European Energy Sector is characterized by its lack of focus on energy data and services related startups. Across the board, when Energy startups get funded in Europe it is safe to assume they have a hardware-related solution Market Impact and Competition The Energy sector has considerable few concrete Big Data instantiations in the market. Mainly the Big Data market in the Energy sector revolves around the end use of energy and related energy data. The data is exposed through digital measuring technologies and increased use of the web as a means of connecting with the energy end users. These are widely referred to as smart grid technologies, and reach from home automation gateways, to smart meters, to all sorts of new digital measuring and control units in the field. Whilst there are opinion leaders that connect the dots that the smart grid technologies pave the way to Big Data in the Energy sector, the majority of the market is consumed with installing these digital technologies and building the right infrastructures both in the field as well as in the enterprises to cope with the increasing volume and velocity of energy data. Data analytics will then be the next step forward to fully realizing the returns of investment into the smart grid technologies. McKinsey analysts state (McKinsey on Smart Grid, 2010) that in the US the energy savings through smart grid technologies could total $130 billion per year. One of the key assumptions is that supply and demand for energy can be better balanced through timely measurements, also called interval data, sampled by smart meters. trend:research, a German market research company, surveyed European countries in 2011 about the market development and potential of smart meters and came to similar conclusions as those about the US market. Additionally, it found that there will be roughly 183 million smart meters until 2020 in Europe. This would equal a cumulative market volume of about 60 billion EUR, whereas France, Germany and Italy have the bigger market volumes with each around 20 per cent (Service Insiders, 2011). The US market, as opposed to the European market though, has many new (Fehrenbacher, 2013) startups around energy data, which will be discussed here briefly. The discussion also shows that there is some momentum regarding competition, including international companies. In the years 2009 thru 2011 the majority of the start-ups and solutions concentrated on energy BIG consortium Page 133 of 162

134 data from the end user business. Other areas of smart grid that generate data at scale are still characterized by fundamental research and/or joint ventures and pilots. And start-ups that concentrate on handling data within utilities core business seem to get funded much later after foundation but the more impactful. But as the examples show, the market trend is towards achieving the ability to analysing massive amounts of data from a variety of sources in near real-time. This focus data analytics of big energy data could be said to have started in early 2012 at least in US. Energy Data Start-ups and New Energy Data Solutions by Incumbents C3, is a four years old start-up that has raised 100 million USD. In September 2012 C3 launched a data grid analytics project for PG&E. The project analyses data from roughly 500,000 commercial and industrial buildings owned by the likes of Cisco, Safeway, and Best Buy. In addition, a variety of data sources are tapped into, e.g. data found via Google, energy consumption data from the utility, to weather data from weather information companies. The entire project is said to have required 28 billion rows of data processed by C3. The processing, i.e. aggregation, normalization and loading was executed at the velocity of 500 million records per hour. The data analytics is used to perform energy efficiency audits for all the buildings in PG&E service in real-time. The same project for residential customers is launched in February 2013 and probably means even more data (Fehrenbacher, 2013). The company plans to launch another 10 projects until Other customers include Entergy, Northeast Utilities, Constellation Energy, NYSEG, Integrys Energy Group, Southern California Edison, ComEd, Rochester Gas & Electric, DTE Energy, as well as GE and McKinsey. The article also highlights that the board of directors of the company signalized support from the regulatory side, as it includes former Secretary of State Condoleezza Rice, and former Senator and Secretary of Energy Spencer Abraham. Siebel, the founder of C3, is reported as saying that smart grid analytics is growing at a rate of 24 per cent a year (Fehrenbacher, 2013). Pike Research reports (Navigant Research, 2012) that cumulative spending from 2012 through 2020 will total over 34 billion USD. The report differentiates key segments as: meter analytics, grid analytics, asset analytics, and renewable integration for business intelligence, operations and customer management. In the following are some further examples from these segments: The start-up company Space-Time Insight provides software that offers real-time visual and temporal analysis of a company s operating status overlaid on satellite images. Space-Time was founded in 2003 and first got funded in 2009 (CrunchBase, 2013). Although the software certainly can be used in any domain, such as Oil and Gas or Transportation sectors, which derive value from advanced visualization and real-time analysis, they have made a name in the energy sector and raised a 14 million USD end of 2012 in Series B funding from energy focused investors (Harris, 2012). Their first customer California Independent System Operator (CAISO) tasked with the visualization of status and events from a power network that manages a power load more than 286 million MWh of energy per year across more than 25,000 miles of transmission and distribution lines (Harris, 2011). Space-Time delivers data updates by the milliseconds as opposed to traditional Supervisory Control and Data Acquisition (SCADA) systems that asynchronously poll selected data points in the field every 2-4 seconds. Another challenging application especially in the deregulated European energy market is that of analysing renewable energy feed-in in real-time. California has an ambitious goal of supplying 30% of its demand from renewable energy sources such as wind and solar by 2020, which sounds very familiar for European system operators. However, the Californian ISO can monitor the data from about 4,500 connected energy sources to determine which energy source is the most economical at any given location at a given time. GE s is counting on it that data analytics and visualization will take their software related annual revenue from 3 billion to 5 billion USD within the near future (Maitland, 2012). The Grid IQ Insight software from GE collects data from a variety of sources, such as meters, grid sensors along power lines and hubs, weather reports, and public social media data of consumers. In that BIG consortium Page 134 of 162

135 sense, it is one of the few utility software that takes on the variety challenge of big data and tries to correlate, for example, the location data from a phone tweet to capture the power outage in an area as it happens, and people tweet about it. Also it demonstrates that for traditional nonweb business process there can still be a leverage in tapping into social data sources to improve service offering to customers, in this case to fix the outage quicker or respond to complaint calls from customers in a more knowledgeable way. Also it seems rather unconventional in the utilities IT that the same system for analytics on social data handles data from the grid nodes such as substations and transformers (Fehrenbacher, 2013)., Utility s IT backend is traditionally made up of silos. GE is investing $1 billion in Industrial Internet applications and products like Grid IQ Insight that have the potential to cut $150 billion in waste across major industries like aviation, oil & gas, healthcare and energy by improving asset management and productivity and lowering maintenance costs, the company reports in a recent publication (GE, 2013).The report further states that the Industrial Internet could add from $10 to $15 trillion to global GDP over the next 20 years. Grid IQ Insight analytical applications are built on top of the C3 platform stack. The supporting big data framework additionally encompasses SQL, Cassandra, Splunk, and Tibco components (Fehrenbacher, 2013). GE s big data software is also part of its efforts to sell technology for the Industrial Internet, or bringing digital technologies to the sectors like transportation, aviation, locomotives, power generation, oil and gas development, and other industrial processes. Through the acquisition of emeter Siemens also shows increased commitment to solve the big data deluge that its customers will be facing, once smart meter rollouts are actively delivering data at different intervals from minutes to hours to once a month (LaMonica, 2011). emeter s software is designed to collect data from two-way meters and also multi-utility meters that can measure gas, water, and electricity, process and feed it into the multiple applications in the utility backend. The emeter solution EnergyIP ships with an Oracle database on an n-tier distributed architecture built around a fully embedded Tibco message bus. emeter also offers two different analytics packages one called EnergyIP Analytics Foundation built on an Oracle database, and one on the IBM Netezza analytical appliance for clients with very large data volumes or intensive analytical needs as stated in a recent Gartner Magic Quadrant for Meter Data Management Software (Sumic, 2013). The Magic Quadrant also mentions that emeter has 42 installations in production with more than 23 million meters under contract. It has signed 14 new customers in the past 12 months. The largest installation site is Independent Electricity System Operator (IESO) in the province of Ontario, Canada, with 4.5 million hourly meters. emeter manages smart meter deployments for North American customers including Alliant Energy, Texas utilities CenterPoint Energy and Bluebonnet Electric Cooperative, and Canada's Toronto Hydro, as well as European customers like Vattenfall in Finland and Sweden. emeter's MDM software competes with analytical software from Oracle and SAP as shown in the excerpt from the Gartner Magic Quadrant: BIG consortium Page 135 of 162

136 Figure 35 Gartner Magic Quadrant for Meter Data Management Software (Sumic, 2013) Early in 2011 emeter announced its partnership with Verizon to manage meter data from the cloud (emeter, 2011). But emeter isn t the only meter data management vendor tackling these big data challenges with big data technologies and practices. Itron has a big project, called Active Smart Grid Analytics, with IBM, SAP and Teradata to manage smart meter data for Southern California Edison. This is a joint venture of competitors typical for early stage technology explorations. The multi-vendor solution consists of a data warehouse system from Teradata, a data synchronization product from IBM called Change Data Capture (CDC), SAP Business Objects, and Itron Enterprise Edition (IEE) Meter Data Management (MDM) (Geschickter, 2011). SAP Smart Meter Analytics is one of the latest applications that leverage SAP s in-memory computing appliance HANA to take the mass of data coming in from meters, SCADA and other sources and process it in real time (SAP, 2013). Another, utility IT giant is OSIsoft. 65 per cent of Global 500 process and manufacturing companies use the OSIsoft PI System. OSIsoft raised $135 million from TCV and Kleiner Perkins the first quarter of 2012 (John, 2011). It seems they will be expanding globally and increasingly on energy efficiency based application domains. Schneider Electric bought Telvnet, an IT and industrial automation company for 2 billion USD in In 2012 they announced a first customer for their meter data management software suite called Conductor. Its main feature is the analysis of smart meter data as it streams in 15 minute intervals. This first installation with Canadian utility Entegrus is supposed to host 50,000 smart meters. Telvnet has partnered with OSIsoft to bring the near real-time meter data analytics feature and automation to Conductor (John, 2012). Telvenet s take on meter data analytics shows its unconventional view as opposed to the incumbents: the analytics can be used not only for meter-to-bill applications but for many other departments within the utility, e.g. BIG consortium Page 136 of 162

137 streaming transformer status data can be analyzed to assist field crews or volt/var optimizations can be automatically corrected according to meter power quality readings. Yet another take on sharing the market pie can be found here (Sridharan, 2013): Outsourcing companies could be an option for the myriad of utility IT departments that are resource constrained and hence sceptical towards the new IT paradigms that are needed to manage and analyse data at big scale. The increasing mainstreaming of public and private clouds enables outsourcing companies to ease the utilities pain points. At least this could be a transitory solution for some utilities. Since even if they recognize the value of big data it may not be fast enough to build internal resources needed, which does not only concern technology but also data scientists as will be discussed in Subsection Challenges that need to be addressed Stakeholders: Roles and Interest Energy is a complex business. Vertically integrated utilities were able to manage the complexities better. However, this was not always efficient in a free market sense. Deregulation brings about more business opportunities around energy data, as Figure 36 shows. The stakeholders in this sketch include the rather smallest cell of energy system: the stakeholders around the distribution network. The transmission level and interconnections of stakeholders will be discussed subsequently. Figure 36 Through deregulation and business innovation energy data stakeholder group becomes larger which requires sophisticated data sharing policies or completely new technologies. Source: German BMWi Project E-DeMa Depending on the degree of liberalization there are different roles and interests: Consumer: Traditionally consumers are passive and modelled stochastically. They can change their Energy Supplier, but they do not really profit from energy usage data. Prosumer: An end user with either energy feed-in, flexible loads that can be utilized for demand side management, or both. The Prosumer not only passively consumes energy, but can either provide positive energy reserve through feed-in or adaptation of consumption when needed, or BIG consortium Page 137 of 162

138 within a variable energy tariff. To date there are very few prosumers. The most common prosumer is the end user with energy feed-in. But because the energy feed-in is just subsidized there is no real incentive in contribution to the market well-being. Demand side management with energy data usage is only interesting if there are tariffs that encourage the flexibilisation of demand. Metering Service Provider: Metering service provider offers service to measure and collect the energy usage and/or feed-in data. The data is represented to the end user, and to any service provider when contractually needed or upon the requirement of the consumer/prosumer. Today for example there is almost no market pull for the metering service from the end user. So this role best teams up with energy supplier who can use data to realize better customer engagement. Distributions System Operator: The distribution system operator enables the connection of end users to the power network. But beyond that in a deregulated market there is not much value for the distribution network operator to actively manage end usage data, because the for the operation of the network he is interested in aggregates of data, which he also could collect within network nodes. So the interest in end user energy data is low, but if collected and aggregated the metering service provider could provide some services to the distribution network provider. Energy Suppliers: Energy suppliers buy energy in wholesale markets and sell to end users. So there are quite a few interesting applications of higher resolution energy data here, e.g. more efficient purchasing of energy. But currently the suppliers are bound to buy according to standardized profiles, so they have no gains even if they had very good view of their energy usage portfolio. Energy Service Provider: Energy Service Providers can offer anything that brings benefit based on energy data, e.g. tariff advisory service, energy efficiency services etc. Energy Aggregator: This is a new role, which can prevail if there are more and more prosumers that are not being subsidized but need to be managed efficiently. Because the sole inputs of prosumers are so small that by themselves they could not participate directly in the energy markets. An aggregator could build bigger portfolios and manage them efficiently. But regarding industrial demand side management there are already successful actors. Entelios is a German solution provider for aggregating flexible loads, generation, and storage of industrial facilities and bringing them to market. On the transmission level, or better within the entire interconnected power system there are a few other important roles that are interested in the efficient and effective use of the big energy data and other related data. Consider Figure 36, for example, which shows the interconnected power system that is divided both geographically and organizationally. On national levels there are the transmission system operators that balance among each other and with the underlying distribution network operators. On a pan-european level the transmission system operators manage the system of tie lines. The ENTSO-E is the body of European transmission network operators. European-wide planning and operation roles are assigned to ENTSO-e, and to fulfill the requirements that such a responsibility brings with it, ENTSO-E is in great need for better data exchange platforms and practices. The entso-e.net is the transparency platform that is setup for this purpose. Additionally, Coreso was formed to efficiently manage the electricity system in Western Europe through better coordination and organisational structures at this scale. Their foremost tool for solving this complex task is also by improving cross-national, cross-organizational data flows. Data flows include and are not limited to data about renewable energy feed-in from solar and wind, on- and offshore, large and midscale. Weather data, and data from the increasingly more active distribution networks with prosumers, demand side management, and electronic cars, which account for a completely new way of load profiles, namely mobile. BIG consortium Page 138 of 162

139 Figure 37 - The interconnected pan-european power network is a complex system that could profit greatly through real-time on-demand situational awareness using big data Available Data Sources The Energy sector is built upon the power network consisting of high investment equipment such as transformers and power plants. This is why there always have been a lot of measurements and other data sources to be able to understand the operation and maintenance of these valuable assets. But these data sources will not remain static. A few years ago, smart meter data did not exist. Today there are also much better sensor data to monitor the assets. Not to mention the increasing adaptation of the web as a communication tool for the sector s end customers, results in utilities having to ask themselves, whether they are aware of all the data sources at their fingertips. Structured data sources from within the Energy business include: Smart Meter data can measure usage, feed-in of energy including power, gas, water, heating and cooling. It can measure the units down to the minute or event-based. Technically a smart meter could also measure power quality parameters of the power network. Phasor Measurement Units measure all required power network parameters in real-time and GPS-synchronized, in order to be able to have a dynamic view on large-scale power networks over a wide area. The data is extremely high resolution, reaching from samples per second. Some devices event sample at higher rates. SCADA (operational) data includes almost all data that phasor measurement units also collect. However, there is fundamental difference in usage scenarios as well as in the lower frequency with which this data is collected namely only 2-4 seconds. Additionally, this type of data is not time-synchronized, which means there needs to be some preprocessing and ordering before the data can be utilized. Non-operational data, such as from digital fault recorders is so highly resolved that they are only triggered when a fault happens. The data is then kept in the substations and only retrieved by technicians for post-mortem analysis of the fault. Mainly this is because the communications is not sized to handle burst of output of data at 720 Hz and 8 khz samples. Newer devices that are being researched and commercialized include for example the Digitizer. Digitizer takes sample measurements of current and voltage at 25,000 times BIG consortium Page 139 of 162

140 per second and applies real-time calculations to derive further parameters that relate to power quality (National Physical Laboratory, 2013). This ever advancing measurement and sensing technology is at the heart of the Energy sector and will continue to supply the utility control centres with following information: power factor, power quality, phase angles indicating increasing load on a power line or transformer, equipment health and capacity, meter tampering, intrusion, fault location, voltage profiles, outages, energy usage and forecasting, etc. Dispersed data sources with uses in the Energy business include: Building energy management software enables building owners to more accurately purchase and usage of energy. The same data can also be used to better size the power network or the purchases on the wholesale market. Home automation data, such as data from smart thermostats. EnergyHub for example mined interesting statistics on energy behaviour using their connected thermostat, such as that residents in states with cooler average temperatures are setting their thermostats lower than those living in warmer states (EnergyHub, 2012) Electric cars will most certainly use communication to increase charging efficiency and other services that will be needed for increasing the comfort of these cars. E-car roaming is assumed to be another application that cannot be realized with vast data sharing among many stakeholders. Most prominently weather data: since energy consumption, and if renewables are involved also energy generation are highly weather dependent this data source is useful for all stakeholders. Even distribution and transmission network operators use weather data for fault detection and location for example, or calculating the thermal loading of lines. Mining weather data for location-based aspects and forecasting for heating- /cooling-heavy days is among the most prominent uses Drivers and Constraints Utilities are undergoing fundamental changes to their operations as they respond to new regulations regarding Renewables integration and variable time-of-use pricing for consumers, as well as the complexities that new competition brings with it through deregulation of the market, and to the new network neutrality challenge. Power generation for example is an immensely capital intensive industry, which at the same time must become greener, more reliable, and cost-effective. Which sounds like constraints, and are constraints of the energy industry, are the drivers of big data technologies in this industry. The utilities constraints have been consuming their profit or efficiency margins, such that they need to look for opportunities for efficiency increase within generation, transmission and distribution without the traditional way: which was to throw literally more hardware, e.g. transmission lines, at it. These companies will have to make investments into how best harness existing resources, which immediately brings them onto the path to investments in advanced IT infrastructure and data analytics for more real-time situational awareness and automation. The low hanging fruits are in the most underdeveloped areas of the utility business which have been residential customers and distribution networks. There is a lot of return here on understanding smart meter data and mining for intelligence by relating disparate sets of data through scalable analytics (Sridharan, 2013). But potentials are within the entire system of power generation, transmission, distribution, and consumption. Greentech Media has an interesting take on the big data drivers for industry giants like Siemens and its competitors (John, 2011). The rationale why big data software will be important for these incumbents would be that they can only build so much hardware for the smart grid. In a sector that is tied directly to GDP growth and where regulations are additionally driving fundamental BIG consortium Page 140 of 162

141 change in the operations of power networks, it is a smart move to also concentrate on the IT infrastructure to control, analyze and optimize all the smart grid hardware. This way would realize continuous revenues from the hardware investment for both energy hardware vendors as well as the energy companies. The regulation has a quite special role when it comes to drivers and constraints: it is both. The deregulation of the market and subsidies of future-proof technologies to gain wide-spread adoption faster are clearly also the drivers of big data that results from the deployment of such technology. But in some countries, there are still a few contradiction regulations which hamper the realization of business models using these new technologies, hence putting big data the use. The following sections also consider this challenge to be addressed among others Challenges that need to be addressed Traditionally IT departments of utilities have been taking care of their traditional business very effectively, which means these backend IT applications are each tailored exactly to what it needed to do for very effective data analysis, simulations and capacity planning actions. So the IT landscape has become very heterogeneous over the years. At the same time each application function is very limited since they are custom coded to consume data in standardized forms, apply standard algorithms and present outcomes in standard predefined reports (Sridharan, 2013). The new paradigm that the utilities are faced with leaves them with huge volumes of data from a variety of sources such as meters, sensors, energy efficiency programs and customer portals etc. These data reside in several system of record archives that scale to trillions of records representing tens of petabytes of data that require further processing to extract information (Sridharan, 2013). An entirely new approach to data analytics is needed to get that information: a big data approach. Rather than mandating fixed schemas and models and then throw away the rest of the data that doesn t fit in into what they believe reflects reality, does not offer enough elasticity that they need to cope with the new ever fast changing requirements the energy sector is facing as a whole. However, these requirements open a complete new take on the business, which traditionally meant competitive edge lies within building power plants, transmission lines to plan and operate them at fixed predefined rules. If the reality did not fit the rules, not the models but reality was changed: e.g. if there is too much wind, the turbine is tripped as not to disturb the planned operations. This is a way of thinking in the entire industry, that is already being challenged by the free market and regulation but before a big data mindset can be established, the industry needs to see and be convinced of the effectiveness of big data technology in their realm. It does not help that old technology is rebranded under big data during this hype time: the Business Spectator survey reveals the energy sector does not consider its current Big Data tools highly effective. Less than eight per cent of C-Suite respondents who are from the energy sector gave the top rating to the tools they are currently using (Business Spectator, 2012). This clearly contradicts the fact that the vast majority of utilities have not yet changed anything in their IT installations that were in place before the data deluge. As reasoned in the Subsection Market Impact and Competition there are fast movers and first success stories and none of the solutions are conventional. This contradiction also makes clear that in the energy industry there needs to be clarifications, best practices and more open research that clears any questions of the stakeholders regarding their new business environment with big data. The idea of a data hub or bus for example is even new to the industry; many utilities try to tackle the problem of mass data management and analysis of the end user consumption data with their traditional billing systems, which were sized to handle yearly reads and billing. The C-suite needs better consultancy when it comes to differentiating big data technologies from the current IT solutions. This presumably is a phenomenon that comes with the big data hype and probably is not specific to the energy sector. BIG consortium Page 141 of 162

142 Meter data management has a broad definition that in practice it can be anything from meter-tobill platform, to complex data busses that link multiple core utility applications. The main challenge in the IT backend is to get all the different software working together. Data flows need to be designed efficiently. This means that all the involved departments need to communicate and formulate their requirements towards the usage of energy usage data. But oftentimes the problems already start at getting accurate and valid data from the field in the first place (John, 2012). IT communication failures are the norm in the field. The systems need to cope with dropped or delayed data whether due to software or component failures, or power network problems. To illustrate the impact of data loss in a big data setting, consider the example of a utility like Entegrus: Losing only 5 percent of a month s worth of smart meter data reads from the installed 50,000 meters in the field adds up to 1.8 million missed reads per month (John, 2012). These readings are needed for accurate billing of customers. Communication problems can be remedied using newest network reliability techniques. But analytics within the meter data management system could also be powerful. The challenge here is that the meter data analytics must be able to distinguish whether data is implausible because of errors or because the usage pattern really changed drastically due to other reasons such as price signals or vacation of end user. That brings us to the next and maybe biggest challenge: privacy-preserving energy data processing. Through anonymisation, data can be used without the end user having to be concerned about individual profiling. However, with a variety of data sources and advanced pattern recognition, anonymisation could also be reversed. The challenge of privacy-preserving algorithms and analytics needs to be tackled and effectively communicated to end users in an easily understandable yet convincing way. If the end users are afraid of sharing their energy usage data with the service providers, then there is no big data market. All these technical challenges also have non-technical aspects to them: previously clear separation of responsibilities has resulted in silos not only regarding the data and applications but also the mentalities and processes. The utilities need new ways of thinking. Oracle's "Big Data, Bigger Opportunity" study surveyed executives at North American utilities with smart meters (Vespi & Hazen, 2012): Indeed, the utilities surveyed report the lack of talent as one of the main problems. Only 46% of utilities have a meter data management system, which means the rest is trying to handle their first attempt at big data without any appropriate tooling. 60% of the surveyed utilities, the meter data is owned by the metering department, which means increased difficulty of exchange across departments. Another interesting challenge lies for the solution providers to understand that in differently regulated markets there will be other chains and networks of value generation. They need to adapt their offerings. For example the main benefit of meter data in an integrated utility is realized within the distribution business. In a deregulated market however, the distribution network operator does not need to know the exact energy usage of its entire customer base. Instead, energy retailers will profit faster from this better understanding of the customer, as a side effect they could offer better understanding of demand on distribution network aggregated as appropriate. This is a typical big data economics trait that data originating in one domain can bring most of its business value in another domain with side effects that can additionally be harnessed Role of regulation and legislation Regulation in the Energy sector is a hot topic when it comes to the analysis of big data economy. The Energy Acts, both the European EnWG and the American EPAct, encourage the usage of advanced technologies that pull up more data, hence information and knowledge about the use of energy, such that it can be used more efficiently. But it is very difficult for a regulatory body to find the right balance between regulating and setting the right incentives. Following are couple of international examples. BIG consortium Page 142 of 162

143 In Germany the distribution network operator is regulated and metering point provider is liberalized in many cases, especially municipal utilities they are still the same company. The regulated distribution system operator is under incentive regulation and hence has no incentive to invest in smart metering technology that delivers real-time usage data. Since the beginning of 2011 are energy retailers bound to offer variable tariffs that make use of the end users knowledge about its energy usage or some sort of home energy automation system. However, the energy retailers have no incentive to offer such variable tariffs because they are bound to buy the electricity according to standardized load profiles. So they cannot make any use of the efficiency increase, in the worst case they lose if demand varies to far from the standard due to users changed actions through their knowledge about energy usage and costs. The German Weights and Verification Act prohibit the analysis and aggregation of data for variable tariff design and implementation in the backend using meter data management software. Instead the meter needs to have verified time-of-use slots in which the energy usage is counted. Implementing a new variable tariff or the move from one tariff to another would mean changing the meter. This last case makes it obvious that the energy industry needs an update and compatibility check for all its laws whether they are suitable for the on-going digitization and the upcoming datafication of the energy sector. UK is another deregulated market that states that the rapid and coordinated rollout of smart meters are not possible in such a short time as aimed: smart meter deployment in 65% of UK homes by 2015 (Sentec, 2012). The UK may create a data communications company that centralizes all the consumption data from the country's energy meters (Energy Efficiency News, 2012). This company s most important tasks will be to engage the end users and convince them that their data is safe and smart meters make sense. The government itself seems convinced that smart meters will accumulate 7.2 billion GBP over the next 20 years through accurate billing and efficient energy use (Energy Efficiency News, 2012). The US had yet another take on this regarding regulation, which is also rather trial and error of finding the right balance: In 2012 California s Public Utility Commission ordered its investorowned utilities how to provide access to consumer data, i.e. what data to show, how and when. Utilities were also to pass the data to other companies if users wanted, e.g. in order to receive energy efficiency related services (Berst, 2011). After a lot of consumers however protested that they did not ask for this favour, the Commission took a turn and designed the opt-out option, where customers pay - $75 per person, plus $10 per month to keep their meters (Marcacci, 2012). Along-side the opt-out option was however the joint adoption of the Privacy by Design code of conduct for all smart meter deployments ( Privacy by Design, 2011), which is really how regulation should be: constructive and a joint effort by experts, departmental policy makers, and end users Big Data Application Scenarios The following big data application scenarios in Energy are mainly concerned with the flexibilisation of the power grid for increasing energy efficiency through energy automation. Energy Automation, like Industrial Automation for Manufacturing, is probably one of the most mature and at the same time economically most impactful areas for kick-starting Big Data Economy in these rather traditional domains. Another motivational aspect for selecting scenarios from these sub-domains of the Energy sector is that they require Europe-wide regulatory attendance, since they also concern multiple stakeholders and most importantly the end users, including commercial and industrial users of energy, who could additionally profit from the efficiency increase through Big Data applications. BIG consortium Page 143 of 162

144 6.3.1 Intelligent On Demand Reconfiguration of Power Networks Application Scenario: On Demand Reconfiguration of Power Networks Description The power networks today are planned well ahead of their installation and operated according to these plans. Especially with the increasing feed-in of intermittent power supply from renewable energy sources, the conservatively planned transmission links for example increasingly encounter congestions, although there is still a 40-60% of untapped security margin capacity. Example Use Case In this scenario, digital network assets such as tap changers, circuit breakers etc are operated based on real-time knowledge about actual contingencies and load conditions of each transmission, distribution line. Such an on demand reconfiguration would ease line overloads due to sudden unexpected intermittency through wind and solar power. This would help network operators and renewable energy power plant operators to maximize their resources. The network operator can enlarge his planning horizon for new power lines, which involves a very lengthy approval process. Plant operators can realize profits because they can operate their plants according to the natural availability. Today inefficiencies exist, for example because network operators have to pay a lot for increasing power reserves that can absorb the sudden interruption of feed-in due to temporal change in wind. The high reserve costs are given to the end-users, so power end users would also profit from this more efficient use of the actual owned network resources rather than resources that have to be bought at high costs. In comparison: 1 million USD worth of digital equipment and software needed for the elasticity increase of physical power network, can bring about equal ease of congestions like the investment into a transmission line worth million USD (Tait, 2011). For the near real-time computation of what lines are to be switched off or back on to operation to remedy a given congestion requires the fast analytics of very high resolution power network condition data. Through parallelization and in network processing the required performance may be achieved to return remedial action or action that is economically most feasible at a given situation in near real-time (Hedman, Ferris, O Neill, Fisher, & Oren, 2010) High-resolution data from the field is managed efficiently. Advanced computation on the streaming data on power network element status and power dynamics reveals events, e.g. increased loading of a line due to failure of another line, as they happen. Advanced analytics also suggests which line(s) to switch (back) on or off in order to remedy the contingency. Conventional power reserves are not needed. Also since the situation is alleviated fast enough, emergency actions, such as load shedding, are also avoided. User Value User Impact of Application Scenario: high, because today the economically high costs of blackouts, renewable energy cut-off due to contingencies, etc. are not even accordingly accounted for. But the industry is well aware. End user costs that can be saved off energy costs due to better embedding the renewable energy sources are rather easy to quantify and are also considerably high. Operating transfer capability on a transmission and distribution path could easily quadruple with big data insight and according BIG consortium Page 144 of 162

145 Prerequisites Data Sources Application Domain Type of Analytics Required Big Data Technology Sources action (Tait, 2011). Maturity of Application Scenario: Implemented manually and only for emergency situations (functionality available but without BIG Data), the next step would be to have simulation environment for this scenario to gain trust of the users in the big data technologies. The prototype should be trustfully tested in this simulation environment that reveals the interdependent characteristics between power networks, data/knowledge about actual situation of the network and the effectiveness of the automatically identified actions. Only then a buy-in from industry can be expected and stepwise deployments into real world prototypes and applications. Financial Impact: TBA Alignment with existing IT capabilities Data Digitalization and GPS synchronization in wide area Data Quality Advanced Data Analytics Data Security and Confidentiality Requirements System Incentives in deregulated Energy Markets Phasor Measurement Data, Digital Fault Recorder Data Topology Data Geospatial Data Network Element Status Data Weather Data Energy Markets Renewable Energy Integration Power Network Operation, Protection & Control Substation Automation Power Network Planning Smart (Distribution) Grid Basic Analytics (e.g. monitoring, reporting, statistics) Mature Analytics (e.g. data mining, event/fault classification,) Advanced Analytics (e.g. prediction, remedial action identification) For example considering only the PMUs from all possible other timesynchronized data sources such as DFRs, Relays, SCADA, simulators, other sensors: 50 million data samples/ day/ device; 100s-1000s of devices Streaming; event-based Data Acquisition: Wide-area latencies, locality, timeliness; cross-tso/-dso Data Analytics: both basic (e.g. dad data, missing data, aggregation) as well as complex and advanced analytics (e.g. phase angle differences in widearea, identification of sequence of remedial actions including alternatives,...) Data Storage: both online real-time vs. offline historical data access Data Usage: Cross-organizational synergies vs. confidentiality Interviews, Research (e.g. (Fehrenbacher,2013), ( Robust Adaptive Topology Control, 2012)) BIG consortium Page 145 of 162

146 6.3.2 Flexible Tariffs for Demand Side Management Application Scenario: Flexible Tariffs Description Demand can technically be flexibilised, meaning it can be adapted according to available power supply to increase the efficiency in balancing demand and supply. To accordingly reward this flexibilisation so-called variable or flexible tariffs have to be designed and implemented. Example Use Case Compared to today s energy usage tariffs that require almost no data exchange with the end user there will be an immense increase of need for energy usage data. Today the required level of energy usage data is only managed for customers that are significant enough, i.e. commercial and industrial, such that the high costs of management and maintaining more personalized power usage tariffs can be justified by the returns for suppliers. Big data technology could reduce the costs for management and analysis of data enabling efficient design and implementation that would not on reduce the costs for suppliers to maintain the existing variable tariff offers but really offer them to their entire customer-base on mass scale End users would employ technology that helps them become more flexible about their daily energy consumption without realizing a difference in comfort. Energy suppliers who have a more controllable load portfolio can use this to balance out any forecast errors and whole purchase violations, e.g. too much, too few power for a transient time period. The suppliers can give some of the gains to the customers in forms of tariff reductions or other incentive models. In general the more efficient the balance between supply and demand of power is realized the better it is for the entire economy and environment. A multi-party building in a dense city area has 35 parties with night storage heating in their homes. A traditional central heating for would require the installation of pipelines, which is economically prohibitive. However, over the long run, running these night storage heating with electricity that is being generated far away and transmitted and distributed under considerable energy efficiency loss is also prohibitive. The property management decides to install combined heat and power station that receives gas from the city and generates heat to feed back into the municipal heating distribution network. The waste product of the combined heat and power station is the power that can be used for the electrical heating. The storage capability of the traditionally coined night storage heating is used to buffer the local variations of demand in heating hence excess power. If the local storage is not sufficient or the feed-in prices are better the excess power is sold to the utility. The flexibility demonstrated here can only be billed through usage of real-time metering of gas usage, power usage, power feed-in, heating feed-in. And to offer such variable tariffs and make use of the flexibility on the purchasing side suppliers manage mass data and use advanced analytics both short- and long-term. User Value User Impact of Application Scenario: high, since this scenario not only stops the wasteful end use of energy but actually realizes the value of energy savings on the wholesale level. BIG consortium Page 146 of 162

147 Prerequisites Data Sources Application Domain Type of Analytics Required Big Data Technology Sources Maturity of Application Scenario: Implemented manually for a selected few significant customers such as industrial facilities (functionality available but without BIG Data)), also on the facility side there is no flexibility usage today, there are a lot of pilot implementations already but the regulatory aspects of variable tariffs are unclear in Europe. There are very few service providers for demand side management in Europe. Financial Impact: TBA Alignment with existing IT capabilities Data Digitalization through Smart Meters Data Quality, Validation and Plausibilisation Advanced Data Analytics, Anomaly Detection Data Security and Privacy Requirements System Incentives Energy Usage Data, Multi-utility, i.e. Gas, Water, Heating, Power Electricity Market Data, Prices Geospatial Data Environmental Data, such as Weather Data Energy Efficiency Efficiency of Energy markets, Portfolio Management Municipal Multi-utility Network Management Basic Analytics (e.g. data cleansing, aggregation, reporting) Mature Analytics (e.g. data mining, data validation and plausibilisation, billing) Advanced Analytics (e.g. prediction, portfolio optimization, tariff design) For example considering only the Smart Meters from all possible other data sources such as home automation gateways, sensors, intelligent loads, such as smart thermostats data samples/day/device; millions of devices Streaming; event-based Data acquisition: Privacy/confidentiality; multiple service providers need energy usage/feed-in data Data analysis: Bad data, missing data, aggregation; anomaly detection Data curation: validation and plausibilisation of usage data Data storage: Replication at each party vs. data marketplaces Data usage: Cross-organizational synergies vs. privacy/confidentiality Interviews; E-DeMa Big Data Requirements The Big Data requirements of the Energy Sector are manifold. For an initial synthesis from the current background research, some interviews and expert opinions we concentrate on the obvious big data delivering data sources that are digital measuring units. But there are also unstructured data sources that recently gain more interest under customer engagement aspects. Along the entire workflow of acquisition, analysis, storage, curation and usage of this obviously new to handle data sources we extract some of the related requirements. Primarily BIG consortium Page 147 of 162

148 this section will provide the discussion basis for the road-mapping activities together with the technical working groups Data Acquisition Data acquisition from the field devices, such as smart meters, measurement units and fault recorders etc. is difficult. Mainly because loss of data is the norm rather than the exception. There needs to be research into technical possibilities to reduce the loss of data on the communication network level. But there could also be additional possibilities on the application level such as inherent redundancy, which at first even doubles or triples the already huge amounts of data. Either the data acquisition must be more reliable or there needs to be advanced analytics that can reduce the impact of lost data. For the billing of new variable tariffs for example the data needs to be validated and windows of lost data accountably filled with derived data. At the same time if it is energy usage data, privacy and confidentiality is a huge concern. Energy usage data of commercial and industrial buildings can be used to infer insight about the business, which raises the confidentiality issues. Equally, end users can be profiled down to what type of coffee they are drinking or what TV channel they are currently watching. Privacy and confidentiality preserving data acquisition can be a unique selling point for all stakeholders Data Analysis Utilities have been implementing intelligent electronic devices such as, phasor measurement units and smart meters in the recent years. It is conceivable that the data from this new breed of digital equipment will accumulate into a very resourceful well of information and insight, if it can be analyzed efficiently. Simply managing massive amounts of data has no value in and on itself. Smart meter data analytics can theoretically be the information delivery backbone of the distribution system. Analytics must be able to pinpoint emergencies in the system, such as congestion at feeder lines or heavy loading of distribution transformers, in addition to the advanced analytics that is needed to design and implement more personalized flexible energy usage tariffs. A rather recent upgrade of emeter s Meter Data Management platform includes broad analytics suite including outage detection, power load monitoring and meter asset health, along with the end user related analytics (Business Spectator, 2012). Data analytics in these scenarios also requires the new breed of utility workers who can tell a pointed story after all that data is managed and analysed. The art really is to define a few key performance indicators within the myriad of data, such that expert decision making is supported. For defining the indicators you need both a very deep understanding of the utility business, as well as in the data sciences. Indeed utilities reported a lack of new talents in the survey of utilities with smart meters (Berst, 2012). Energy sector also deals with unstructured data, e.g. entered in free text fields by field crew, or by customers in systems, and feedback applications. This unstructured but potentially valuable customer data is hard to mine. About 80 per cent of how utilities communicate with their customers is said to be unstructured (Smith, 2012), i.e. notes in call centres, a technician notes in the field during maintenance are just a few examples of need for sophisticated text mining, or audio natural language processing for customer call centre data Data Storage Data storage is the one requirements field that is already being filled with big data technologies. In the US for example there are a few projects and pilots that use openpdc which is built on top BIG consortium Page 148 of 162

149 of the Hadoop framework (Smith, 2012). emeter started utilizing Cloud Computing and offering meter data management as a service through a partnership with Verizon (Tweed, 2011). Other companies like GE, Lockheed Martin, Honeywell, and Itron also investigate cloud computed smart grid services (John, 2011). UK government has initiated also a move toward tapping into the economies of scale for big data storage with strict privacy and security guidelines (Energy Efficiency News, 2012). So, utilizing big data technology for storage seems to have convinced the industry at least in a few pioneering examples. But there is still the issue of streaming data writes, which for example cannot be resolved with the Hadoop framework. However, do utilities really need multiple storage systems for online and offline data access? Research here would shed some light on the different data access requirements and the involved trade-offs, which seem not to be clear in the industry. The majority of the energy sector uses so-called historians, which are kind of a data warehouse for time-series analysis. The sector just recently started also requiring more sophisticated dashboards and analytics capability on top of these historians. Analytics on mass data can take a considerable amount of time, if it is not computed using some sort of efficiency increasing factors, such as high performance computing, parallel computing, in-memory computing, distributed computing etc. So data storage must enable the required efficient access to data at least for fast analytics, but there are a lot of applications with very divergent requirements on data access. More sophisticated solutions here will certainly see interest from the industry Data Curation A recent example for the need for data curation is the immediate complaints of end users after smart meter enabled billing was piloted or rolled out. The customers complain for example of being charged during outages, as well as poor customer support in resolving these issues (John, 2012). This is a typical case when the industry standards for so-called validation, estimation and edit just fail, when real-time measurements are supposed to be used. Here data curation in form of automatisms such as anomaly detection and resolution could help: For example, if a range of interval data is missing or implausible, is this due to power outage, meter failure, communication failure, vacation, demand response or else? Additionally, data curation must be coupled with customer care, such that the call agent immediately can see where a complaint is rooted, what possible causes it could have, where the system is at identifying the cause of implausibility and what outcome is expected, or even change the outcome because it was something else then the system automatically identified. An interesting requirement is also about how far the social media can be integrated in crowd sourcing data curation. For example, can tweets about power outage be used to validate, and mark the missing interval data as being due to this outage? Data Usage When it comes to data usage the industry has three separate groups: the ones who are sceptical that there is anything usable within all that data and who deem any digitalization is a risky investment therefore. There is the next group that know that something big will emerge at least requirements wise, and they know they have to make use of the data otherwise there will be no return on their investments in digitalization. And then of course there is the small group of early adopters who use the newest paradigms from data heavy sectors like the web and adapt it with success but potentially pay a high price because the adoption of these technologies cannot be done within the utility without the technological experience so a lot of their internal knowledge is at risk to move to companies that may support the competition at the next project. BIG consortium Page 149 of 162

150 But still since data-driven businesses are exponentially fast moving, early adopters will probably gain a head start that the competition may never catch up unless another new disruption levels the playing ground. Let us concentrate here first on the group that has requirements for data usage. Once benefits become clear the other groups will move into the same lanes. There may then be a tsunami of ideas on innovative places to look for new data that can be correlated with data of established systems. But still there will be need for business thinkers to identify a specific need to be met and realize the value of big data to fill that need specifically. Regarding the use of unstructured data, here are a couple examples: Call centre data can be used to improve outage root cause analysis. The field crew member s notes during maintenance or post-disturbance procedures can be used to build models to help understand causes of asset failure and identify the causes in future before they become prevalent (Smith, 2012). Another typical use of big data from the web is marketing, i.e. customer segmentation, product portfolio management, etc. of course energy suppliers and retailers can as well adopt these same usage scenarios since they have the same needs. Energy usage patterns could be understood better through finer resolution of usage data. The use of structured time-series data, coming from smart meters or other measuring units can also be used for product portfolio management, customer segmentation etc.: With real-time measurement of energy usage, variable tariffs can be implemented, which increases customer choice. In a survey (King, 2011), 73% of customers requested these options. These demand side management programs, of which variable tariffs are only one use of energy usage data, could reduce peak demand by 20% (King, 2011). But the capability of processing and analysing vast amounts of data in a technically trustworthy way that does not invade the privacy of consumers is a prerequisite to this type of sophisticated tariff design. Energy usage data also has usage areas completely new to the Energy sector. Research (King, 2012) shows energy savings at the end users premises can be 5-10%. So do traditional Energy suppliers need to rethink and sell energy efficiency products, or will there be new players for use of this data to save energy? Nonetheless traditional energy suppliers can make use of their variable tariff portfolios to make savings on purchases issued on the wholesale market. In wholesale electric markets, prices change every hour or more frequently. The European Energy Exchange also has a spot market. Without real-time energy usage data, retailers pay for energy based on hourly estimates of customer usage. These estimates level over the course of one year maybe, but in shorter terms, even if customers lower their energy use, the Energy retailer pays the higher price for peak energy, because the estimates of course do not change. With real-time energy usage, retailers could realize real savings (Carrasco, 2011). Power grid monitoring data, when delivered at a finer resolution than the low resolution measurements used in today s SCADA system, the utilities can better make use of their assets taking the dynamics and other environmental data into account. Such situational awareness can prevent blackouts, but also enable better utilization of power grid assets and increase operational efficiency. The better knowledge about the system based on real measurements additionally helps system operators better plan and size their systems increasing the efficiency of investments. Grid investment savings through use of real measurements instead of simulations based on models can total tens of millions of dollars ( SmartConnect Deployment, 2011) for a large utility. Renewable Energy integration is one of the other new areas in the Energy sector that demands real-time management and analytics of vast amounts of data from a variety of sources. Wind and solar oftentimes generate too much or too little power, which the sector calls the intermittency of renewable energy sources. And so far the industry s coping strategy is to cut these sources off when there is not enough demand to hold the balance, and hence forego free power. There are a lot of small pilots in Europe for how demand response and variable tariffs are potentially beneficial. Hawaii demonstrates a large-scale pilot of Renewable Energy integration through use of more and varied data sources to realize the offsetting of the natural fluctuations in Renewables (Shahan, 2011). BIG consortium Page 150 of 162

151 6.5. Conclusion and Next Steps There is lots of data in volume, velocity and variety, especially in the Smart Grid domain of the Energy sector that can potentially be used and analysed to enable new business models. But there are serious burdens to overcome such as regulatory, as well as best practice related. Regarding the big data mindset, there is the entire spectrum in the Energy sector: reaching from Internet pioneers that see a next Internet that is even bigger, to sceptics who doubt whether all that data can be better than the industry s fine tuned models and practices. In this first round we have looked at a side of the Energy sector which is known as smart grid, which currently sees a lot of movement towards more data with all accompanying opportunities and challenges. Certainly there are other areas where big data technologies can or are being made use of. The next steps involve extending the interviews to a wider stakeholder group as to have opinions from a variety of markets within the Energy sector, not just the smart grid market. Although certainly regarding big data, smart grid is the experimental lab of the energy industry. What is dangerous though is that there is also a lot of confusion as to which of the old and new technological offers are big data technologies. And hence, some pilots never see continuity because they do not scale beyond the pilot phase, which hinders big data business from growing to its full potential. Industry workshops with the stakeholders should be an aim to clear such misunderstandings. One other lever which should be utilized is the regulations ability to kick-start the ecosystems that are needed for business value being realized from mass data. During the next round of interviews and stakeholder engagement activities, one key aspect will be finding the answer to the question of how to foster such ecosystems around energy data References Berst, J. (2011, July 28). Smart meters: California PUC issues sweeping data access orders. Smart Grid News. Retrieved from PUC-issues-sweeping-data-access-orders-3869.html#.UXBouMrk9R5 Berst J. (2012, January 16). Smart Grid Companies: emeter jumps into analytics fray. Smart Grid News. Retrieved from Berst J. (2012, July 10). Smart grid Big Data: Survey confirms utility opportunities, but spots blunders as well. Smart Grid News. Retrieved from Survey-confirms-utility-opportunities-but-spots-blunders-as-well-4949.html Business Spectator. (2012, December 5). Big Data Survey - Energy: Energy sector s greatest asset what they already know. [Special Issue]. Retrieved from file_file/big Data Survey - Energy.pdf Carrasco A. (2011, September 14).Great Britain: Half-hourly meter data can yield cascading benefits. Retrieved from CrunchBase. (2013). Companies: Space-Time Insight. [Directory]. Retrieved from Dupre R. (2012, March 9). Big Data Has Arrived in the Energy Sector. Rigzone. Retrieved from BIG consortium Page 151 of 162

152 Entso-e. (2013). Statistical Database. Retrieved from emeter. (2011, February 11). Verizon Teams with emeter to Enable Meter Data Management from the Cloud. [Press Release]. Retrieved from press-releases/verizon-teams-with-emeter-to-enable-meter-data-management-from-the-cloud/ Energy Efficiency News. (2012, November 8). UK government outline next steps in smart meter rollout. Retrieved from EnergyHub. (2012, January 19). Does Living in a Colder Climate Make You Warmer on the Inside? Retrieved from Fehrenbacher K. (2013, January 3). 13 energy data startups to watch in GigaOM Retrieved from Fehrenbacher K. (2013, February 7). Tom Siebel s $100M big data energy startup C3 finally emerges as a player. GigaOM Retrieved from Fehrenbacher K. (2013, January 29). Using a tweet to get the power back on faster. GigaOM. Retrieved from GE. (2013, January 29). The Grid of 2054, Today: Grid IQ Insight Uses Futuristic Tech to Detect Today s Outages. GE Reports. Retrieved from Geschickter C. (2011, November 2011). How Itron Plans to Capture the Smart Grid Big Data Opportunity. Greentech Media. Retrieved from Harris D. (2011, July 19). How California ISO uses Google Maps, big data to manage power. GigaOM. Retrieved from Harris D. (2012, September 19). Space-Time Insight raises $14M to put your data on a map. GigaOM. Retrieved from Hedman K. W., Ferris M. C., O Neill r. P., Fisher E. B., & Oren S. S. (2010, May).Co- Optimization of Generation Unit Commitment and Transmission Switching with N-1 Reliability. IEEE Transactions on Power Systems, Vol. 25, No. 2. Retrieved from Information and Privacy Commissioner Canada. (2011, February 2). Operationalizing Privacy by Design: The Ontario Smart Grid Case Study. Retrieved from John J. (2011, November 14). Smart Grid Revolution? GE Launches Smart Grid as a Service. GreenTechMedia. Retrieved from Revolution-GE-Launches-Smart-Grid-as-a-Service John, J. (2011, December 9). Siemens, Competitors Snapping up Smart Grid Software. Greentech Media. Retrieved from John, J. (2012, February 23). Smart Grid 2.0 Means Real-Time Pricing, Data Analytics and More. Greentech Media. Retrieved from BIG consortium Page 152 of 162

153 John, J. (2012, October 31). Big Data, Smart Meters and the (Near) Real-Time Grid. Greentech Media. Retrieved from King C. (2011, April 8). Dynamic electricity pricing for consumers: 5 myths and truths. Retrieved from myths-and-truths/ King C. (2011, October 6). Most U.S. consumers want time-of-use electricity pricing. Retrieved from King C. (2012, November 12). Proven recipes for successful consumer engagement. Retrieved from LaMonica M. (2011, December 5). Siemens buys emeter's smart grid 'big data' software. C-Net. Retrieved from Maitland J. (2012, November 9). GE readies big data analytics platform, targets utilities. GigaOM. Retrieved from Marcacci S. (2012, April 2). How to Stop Smart Meter Opt-Out Mandates from Advancing Across America? Clean Technica. Retrieved from McKinsey on Smart Grid. (2010). Can the smart grid live up to its expectations? Latest thinking Number 1, Summer Retrieved from g/mckinsey_on_smart_grid Navigant Research. (2012, September 17). Smart Grid Data Analytics: Smart Meter, Grid Operations, Asset Management, and Renewable Energy Integration Data Analytics: Global Market Analysis and Forecasts. Retrieved from National Physical Laboratory. (2013, April 17). On-site measurement improves Smart Grid design. [Press Release]. Retrieved from SAP. (2013). Smart Meter Analytics: Turn smart meter data into powerful insights and actions. [Product Web Site]. Retrieved from Sentec. (2011). The European Market for Smart Electricity Meters. [Infographics]. Retrieved from Sentec. (2012, March 7). Smart meter rollout: two speed Europe as Germany lags behind its neighbors. Retrieved from Service Insiders. (2011, November 25). Studie Smart Metering in Europa bis 2020 : Marktentwicklung und Potenziale. Retrieved from insiders.de/news/show/1122/studie-smart-metering-in-europa-bis-2020-marktentwicklung-und- Potenziale Shahan Z. (2011, November 16). World-Leading Smart Grid Demo on Maui Island, Hawaii. Clean Technica. Retrieved from BIG consortium Page 153 of 162

154 Smith M. (2012). The challenge and promise of unstructured data Utilities navigating a more complex world of data. Intelligent Utility Magazine July / August Retrieved from Southern California Edison. (2007, July 31). Edison SmartConnect Deployment Funding and Cost Recovery. Retrieved from Stawinska A. (2009). Energy sector in Europe. Eurostat Statistics in Focus 72/2009. Retrieved from Sumic Z. (2013, January 9). Magic Quadrant for Meter Data Management Products. Gartner Industry Research G Retrieved from Sridharan M. (2013, February 21). Big data opportunities for outsourcing companies in the European utilities sector. Outsource Magazine. Retrieved from Tait W. (2011, March 14). Remedial Action Schemes (RAS). [Presentation]. Retrieved from Splash Pages/Attachments/24/Concurrent Lectures Part 2.pdf Texas A&M Engineering Experiment Station. (2012, August 31). Robust Adaptive Topology Control (RATC): Improving Grid Resiliency under Massive Integration of Non-dispatchable Intermittent Renewable Generation, Cascading Faults, and Malicious Attacks. Retrieved from Vol.pdf Tweed K. (2011, February 2). Mixed Greens: SmartSynch and Qualcomm Open an Apps Storefront, emeter Takes MDM to the Cloud. GreenTechMedia. Retrieved from Vespi C., & Hazen J. (2012, July 10). Big Data, Bigger Opportunities: Plans and Preparedness for the Data Deluge. Oracle Utility Transformations. Retrieved from pdf BIG consortium Page 154 of 162

155 Annex a. Big Data questionnaire for public sector BIG consortium Page 155 of 162

156 BIG consortium Page 156 of 162