1 Connecting Europe for New Horizons Berlin, 20 September 2013 Language Technology for Big Data Analytics Wolfgang Wahlster Saarbrücken, Kaiserslautern, Bremen, Berlin, Osnabrück Phone: +49 (681) / WWW:
2 Big Data: Data as Tradeable Assets Sensor Data for Weather, Climate, Smart City and Smart Home 3D-Internet-Data and Media Streams Production and Machine Data from Industry 4.0 Mass Data from Individualized Medicine, from Genom Analysis and Imaging Methods Mass Data from Smart Grid and Smart Metering Financial Messaging Data Supervision of Banks, Stock Exchange and High Speed Trade Life Log Data for Individuals and Objects Digital Product Memories Mass Data from Mobility by Car2Car-Communication and Logs from Single Vehicles Mass Data from Social Networks 95% of 1.2 zettabyte worldwide digital data are unstructured with a data growth of 62% per year.
3 Outline of the Talk 1.Human- and Machine-generated Big Data 2.Major German and European Big Data Initiatives 3.The Role of Language Technology for Big Data Analytics 4. The Need for Real-Time NL Analysis and Generation for Application Impact 5. Conclusion
4 Development of Global Data Volumes by 2020 (in Zettabyte) Source: AT Kearney 2013 : Mainly Machine-Generated Data, but also encoded in Natural Language for Human Inspection, Multilingual Natural Language Generation becomes more and more important
5 Human-Generated Internet Content: Zettabyte of Unstructured BIG Data Exponential Growth of Internet Data: Commercial Spoken Language Access: New Tweets 48 Hours Videos uploaded Siri (Apple) Google Now (Google) S Voice (Samsung) Cortana (Microsoft) Still Missing: Crosslingual Information Retrieval More than 2 Million Queries 3125 new Photos uploaded messages Every Minute 571 new homepages Apps new s send
6 From DB to BD: New BIG DATA Services New Services based on Cloud Nets Decision Support, Prediction, Simulation, Knowledge Discovery, Information Trading, Fusion, Optimization, Modeling Machine Learning Multimodal Interaction Information Extraction from Text and Video Language Technology as a Key Enabler for BIG DATA Analytics Lower degree of Structure unstructured structured BIG DATA Databases, Data Warehouses, WWW =EXA - ZETTA=10 21 New Data Material Low Information Density Less used for ICT =GIGA - PETA=10 15 Classical Data Material High Information Density Much used for ICT Complexity
7 The Era of BIG DATA: Increasing Importance of Real-Time Natural Language Processing Digital Data ~ 1960 Data Warehousing ~1980 Data Mining ~1990 Big Data Analytics Today Digital data collection First databases Data cubes Relational databases Financial data Statistics Artificial Intelligence Machine learning Knowledge discovery Unstructured data Stream processing Smart Data Engineering Collective intelligence Massively distributed analytics NoSQL databases Heterogeneous data and knowledge Petabytes and Zettabytes of data Documentation Enterprise Management Process Optimization Real-time Decision Support and Control Adapted from Siemens
8 Focus of Data Analytics Is Changing from Description of Past to Decision Support of Today Value & Complexity Act Inform Descriptive Analytics Analyze Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? What shall we do? Current penetration across all industries (according to Gartner 2013) Adopt by vast Adopted by Still few majority but not 99% all data 30% minorities 13% adopters 3% Very few early adopters Adapted from Siemens
9 The Processing Cycle for BIG DATA Language Technology Search for BIG DATA Sources Linked Data Potential Data Fusion Potential Sensor Selection and Sensor Positioning Search for Open Data Sources and Streams BIG DATA Collection Unstructured Data Multimodal Data Uncertain Data Complex Events Sensor Data Data Streams Deep Web Data BIG DATA Analysis Data Cleansing Information Extraction Semantic Analysis Sentiment Analysis Data Correlation Pattern Recognition Real-Time Analytics Machine Learning BIG DATA Exploitation Decision Support in Real Time Prediction Simulation Exploration Modeling Monitoring, Alerting, Reporting Controlling Smart Data Engineering BIG DATA Maintenance Backtracing Data Origins Data Enrichment Annotation and Tagging Data Validation Redundancy Avoidance Consistency Checking Inference & Abstraction Data Storage & Retrieval In-Memory Technologies HANA, TERRACOTTA Column Databases NoSQL Cloud Storage Densification Technology Aggregation Procedures Compression Techniques
10 Preparation of Second ICT Future Project for German Chancellor ( M funding, Start 2014): for Smart Services : Internet-based Services for Economy Business Designer Cloud Infrastructure Clerk BPaaS: Business Process as a Service SaaS: Software as a Service PaaS: Plattform as a Service IaaS: Infrastructure as a Service KaaS: Knowledge as a Service BDaaS: Big Data as a Service Knowledge Worker Application Developer System Administrator Increased Value Creation Virtualization Chain 1 Hardware 2 Software 3 Information & Knowledge (Big Data) Quelle: DFKI
11 German BIG DATA Conferences First National Conference 3 June 2013, Berlin BIG DATA: Exploring Data Treasures in Science and Industry BIG DATA Summit 24 June 2013, Bonn Second National Conference November 2013, Berlin BIG DATA: Potentials for Germany 10 December 2013, Hamburg BIG DATA Platform CEO Forum
12 German BIG DATA Funding Programms 2-3 National BIG DATA Competence Centers Submission Deadline: 12 July Years Perspective Application Oriented Basic Research Finalist Selection: 12 Sept. 2013, Senior Expert Jury: November 2013 Annoucment: IT Summit, 10 Dec 2013 Consortia Projects between Industry and Academia Submission Deadline: 12 July 2013 Applied Research Consortia Projects between Industry and Academia Program Announcement: November 2013 Emphasis on SME Involvement
13 BIG coordination and support action EU consortium (11 partners, including ATOS) The goal is to help EC define a roadmap for Big Data Technologies for Horizon 2020! Strategic Director: Wolfgang Wahlster, DFKI
14 Key facts about BIG Project Type of project: CSA Project start date: September 2012 Duration: 26 months Call: FP7-ICT Effort: 552,5 PM Budget: 3,038 M Max EC contribution: 2,499 M Consortium: 11 partners
15 Major Activities of the BIG Forum Identification of Sector s requisites Requirements and objectives from all Sectors Industry-driven sector forums Applicability of Big Data technology in each Sector Big Data technologies and its capabilities Technical Working Groups Technical White Papers Elaboration of Sector Roadmap Sectorial roadmap (elaborate a roadmap per sector). Contributions towards integrated roadmap (cross-sectorial)
16 Social Media Analysis: BIG DATA NLP and Sentiment Analysis by the DFKI Spin-Off The Net elects Analyzing Twitter and Facebook feeds for German Federal elections 2013, 22 September Analyze for candidates, parties, and top tweets Online feature of German weekly magazine Wirtschafts-Woche, powered by Attensity
17 Social Media Analysis: BIG DATA NLP and Sentiment Analysis by the DFKI Spin-Off Selection of 10,000 Tweets from 400 Million every day + 17,000 Facebook entries per week Advanced NLP: Negation analysis Analysis of counterfactuals And comparatives
18 Example of Technology Transfer via Brains Massimo Romanelli
19 Multilingual Tweet Analysis for Desaster Management 11 March 2011, Fukushima Users broadcast their experience immediately, number of tweets increases immediately after the earthquake and tsunami: 1.5 Million Twitter messages were analyzed by Collier and his group (cf. An analysis of Twitter messages in the 2011 Tohoku Earthquake: Son Doan, Bao-Khanh Ho Vo, and Nigel Collier, National Institute of Informatics) The first Japanese tweets on the earthquake are as follows: T05:48:08 " 地 震!" [Earthquake!] T05:48:08 " 地 震 だ 縦 揺 れ!" [Earthquake ~ ver;cal shake!] T05:48:14 " 地 震!!!!" [Earthquake!!!!] First two English tweets send from an iphone: T05:48:54 Huge earthquake in TK we are affected! T05:49:01 BIG EARTHQUAKE!!! T05:50:00 Massive quake in Tokyo Challenge: Post-hoc Analysis must be turned into Real-Time Analysis The first tweet about a tsunami was an eye witness tweet 6 minutes after the earthquake occurred at its epicentre: T 05:52:23 "オレ 津 波 の 様 子 見 てくるわ!!!!" [I can see tsunami is coming!!!!] The first concerns about nuclear plants right after the earthquake T09:50:49 " 福 島 原 発 ヤバい 状 況 らしい" [The Fukushima plant is in a really bad situa;on.]
20 BioCaster: Early Alerting for Public Health Events - detects seasonal influenza and hay fever. Event database search Trend graphs Ontology browsing /GeoRSS alerting Up to date news in m u l t i p l e languages Event maps WHO IT JP CA GHSAG partners US UK FR DE Event alerts born.nii.ac.jp: Real-time Twitter analysis
21 Alignment of European and German National Projects Dealing with BIG DATA PPPs BIG DATA PPP Forum Chair of Advisory Committee Strategic Director
22 European and German Software Platforms for BIG DATA Processing Generic Enabler for BIG DATA for batch and online stream processing of BIG DATA BigMemory MAX: Real-time Access to 100s of TBs BIG DATA Platform with up 250 TBs in-memory data bases: SAP HANA Open-source cluster/cloud computing framework for BIG DATA analytics
23 BIG DATA Generic Enabler of the FI-PPP The key technical concept of the FI-PPP is the provisioning of Generic Service Platforms supported by reusable, standardized and commonly shared key technologies and components which shall be termed Generic Enablers, which can be applied by a multiplicity of Smart Application usage domains across multiple sectors. FI-WARE Catalogue Future Internet Smart Application Generic Enabler 1 Generic Enabler 2 Complex Event Generic Enabler FI-WARE Instance Generic Enabler 4 Cloud Generic Enabler Generic Enabler 6 BIG DATA Generic Enabler Generic Enabler 8 Generic Enabler N assemble BIG Data Generic Enabler Streaming and batch processing functionalities both in one single platform. Automatic deployment capabilities in a cloud-based cluster of nodes. Wide range of available data injectors. High speed access to the resulting insights via a NoSQL database.
24 The Vision and Mission of FI-PPP FI-WARE Catalogue FI-WARE Open Innovation Lab FI-WARE Shared Trial Facility Specific Use Case Trial Facility The FI-PPP promotes and enables large-scale experimentation and validation of the platforms in real-life application contexts involving a range of actors across domains, including large companies, SMEs, the research community as well as public administrations and citizens. The open platform approach further creates novel opportunities for entrepreneurship, new businesses and innovative value creation models based on cross-sector industrial partnerships. 24
25 BIG DATA Analytics for Financial Fraud Detection and Prevention Use of the Terracotta Platform of DFKI Shareholder Software AG Mitigated 100s of millions in fraudulent credit card transactions Reduced fraud detection processing time from 45 minutes to less than 4 seconds NLP Analysis of Sales Items 99,999% completed transactions with fraud detection rules checked Reduced fraud processing time from 800 ms to less than 20 ms
26 New Business Models of DHL Exploiting BIG DATA Collected by their Employees during the Delivery of Parcels GEOVISTA: BIG DATA TOOL for estimating earnings opportunities and analyzing business potential. Dr. Helbig, CIO of DHL Prepare a realistic sales forecast Evaluate a desired location by using high-quality geodata and NLP CRM reports provided by the subsidiary Deutsche Post Direct. Local competitors are analyzed with the aid of up-to-date data provided by bedirect. Visualization of business-location factors and the area being studied are presented on a digital map.
27 BIG DATA Analytics for Intelligent Urban Management Trento EIT ICT Labs DFKI GmbH
28 DFKI Is a Founding Core Partner of EIT ICT Labs and Has a Strong Collaboration with the Trento Node Trentino Open Living Data (TOLD) The action line leader for Intelligent Mobility and Transportation systems is Dr. Christian Müller from DFKI
29 Living Big Data: The Trentino Territory as a BIG DATA Lab Transportation Energy Telecommunication
30 EIT ICT Labs Business Framework for BIG DATA at Trento Rise
31 DFKI s Social Media BIG DATA Analytics App for Crowd Management London Lord Mayor s Show, Olympics 2012 Coronation of the new Dutch King, 30 April 2013
32 Conclusions 1.Big data technologies are an innovation motor for industry, science and government. Real-Time Multilingual Natural Language Analysis and Generation as well as Translation Technologies are a Key Enabler for Big Data Analytics. 2. Key research challenges are smart tools for real-time analytics and decision support based on intelligent information extraction from unstructured data. 3. Europe s and in particular Germany s strength in big data technology are commercial in-memory computing platforms like Hana or Terracotta and open-source platforms like FI-Ware GEs and Stratosphere and multilingual language technologies. 4. In Germany, the focus is on big data applications for Industry 4.0, smart grids, advanced mobility and personalized medicine.
33 Thank you very much for your a0en1on. Design by R.O.