Big Data. On Distributed Systems to a Distributed Artificial Intelligence. Food for Financials Maastricht, The Netherlands, June 10 th, 2014

Similar documents
Big Data. On Distributed Systems to a Distributed Artificial Intelligence. Aachener Dienstleistungsforum Aachen, Germany, March 26 th, 2014

ehumanities From big data and digital technologies to new and/or enhanced methods in humanities and social sciences

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

How To Handle Big Data With A Data Scientist

Big Data a threat or a chance?

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Are You Ready for Big Data?

BIG DATA CHALLENGES AND PERSPECTIVES

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

COMP9321 Web Application Engineering

BIG DATA What it is and how to use?

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Getting to Know Big Data

Big Data and Analytics: Challenges and Opportunities

How To Use Hadoop For Gis

How To Scale Out Of A Nosql Database

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

So What s the Big Deal?

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Doing Multidisciplinary Research in Data Science

The Data Engineer. Mike Tamir Chief Science Officer Galvanize. Steven Miller Global Leader Academic Programs IBM Analytics

Big Data. Lyle Ungar, University of Pennsylvania

Are You Ready for Big Data?

Applications for Big Data Analytics

The big data revolution

The Next Wave of Data Management. Is Big Data The New Normal?

Sunnie Chung. Cleveland State University

Integrating a Big Data Platform into Government:

Big Data & Analytics: Your concise guide (note the irony) Wednesday 27th November 2013

Real Time Big Data Processing

Transforming the Telecoms Business using Big Data and Analytics

Big Data: Tools and Technologies in Big Data

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Beyond Watson: The Business Implications of Big Data

Statistical Challenges with Big Data in Management Science

Industry 4.0 and Big Data

Smarter Planet evolution

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Big Data Zurich, November 23. September 2011

Big Data Mining: Challenges and Opportunities to Forecast Future Scenario

Collaborations between Official Statistics and Academia in the Era of Big Data

BIG DATA FUNDAMENTALS

BIG DATA TRENDS AND TECHNOLOGIES

Big Analytics: A Next Generation Roadmap

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Big Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Systems, Big Data

DATA MINING AND WAREHOUSING CONCEPTS

Big Data + Predictive Analytics = Actionable Business Insights: Consider Big Data as the Most Important Thing for Business since the Internet

Large-Scale Data Processing

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data Analytics in Space Exploration and Entrepreneurship

White Paper: Hadoop for Intelligence Analysis

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Big Data & Analytics for Semiconductor Manufacturing

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

Introduction to Big Data the four V's

The 4 Pillars of Technosoft s Big Data Practice

A Strategic Approach to Unlock the Opportunities from Big Data

Raul F. Chong Senior program manager Big data, DB2, and Cloud IM Cloud Computing Center of Competence - IBM Toronto Lab, Canada

This Symposium brought to you by

Data Refinery with Big Data Aspects

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Big Data and Industrial Internet

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Hadoop. Sunday, November 25, 12

The Big Deal about Big Data. Mike Skinner, CPA CISA CITP HORNE LLP

BIG DATA MARKETING: THE NEXUS OF MARKETING, ANALYSTS, AND IT

Statistics for BIG data

BIG DATA I N B A N K I N G

Big Data Analytics. Lucas Rego Drumond

From Data to Foresight:

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Big Data Explained. An introduction to Big Data Science.

Big Data Introduction, Importance and Current Perspective of Challenges

The Future of Data Management

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Application Development. A Paradigm Shift

Is Big Data a Big Deal? What Big Data Does to Science

A New Era Of Analytic

Educational Opportunities in Big Data

DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY

Big Data. Fast Forward. Putting data to productive use

An interdisciplinary model for analytics education

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Ecosystem B Y R A H I M A.

Transcription:

Big Data On Distributed Systems to a Distributed Artificial Intelligence (Extended Version incl. Some Banking and Financial Market Aspects) Food for Financials Maastricht, The Netherlands, June 10 th, 2014 Univ.-Prof. Dr. rer. nat. Sabina Jeschke IMA/ZLW & IfU Faculty of Mechanical Engineering RWTH Aachen University www.ima-zlw-ifu.rwth-aachen.de

Outline 2 I. A Bit of a Jump into the Deep End II. III. IV. In Search of a Definition From Google Search and the 3Vs of Big Data Finally Ending up with Distributed Systems and Artificial Intelligence Expansion Big Facts and High Figures and Applications in the Banking and Financial Markets Methods and Concepts behind the Trend About Challenges their Solutions and Open Issues and the Players in the Game V. A Look Ahead From Big Data to Cyber Physical Systems, the Internet of Things and Industry 4.0 VI. Summary

A bit of a jump into the deep end Google Trends: Basics 3 Google Trends: Understanding interests a search-term is analyzed relative to total search-volume - across various regions of the world, and in various languages http://www.google.de/trends/explore

A bit of a jump into the deep end Google Flu: Predicting Future (Predicting the spread of diseases) 4 It all started with the flu [Google Correlate 2011] actual flu trend can be identified 7-10 days earlier by Google Flu Trends than by official data of the Center for Disease Control (CDC) [Helft 2008]

A bit of a jump into the deep end Pandemics: Exploring New Patterns of Complex Scenarios 5 A circle-model to foresee and to analyze pandemics [Brockmann and Helbing 2013] Computational work conducted at Northwestern University has led to a new mathematical theory for understanding the global spread of epidemics. [ScienceDaily 2013] The spreading takes place on the worldwide air transportation network of more than 4000 airports and 25000 direct links. [Brockmann/Helbing 2013] Is the spread of infectious diseases complex, or does it look just complex? [Erickson 2013] Using data of flights, trains, etc. the cities are rearranged. Result is simple: a circular wave that produces a stone in the water. Here: distances of places and countries adjusted depending on the flight connections

A bit of a jump into the deep end Predicting human behavior: Election forecast 6? How Nate Silver won the election with Data Science [Smith 2012] Many data sources Using the past Consistent models Understanding limitations The man behind the forecast Nate Silver (born January 13, 1978) 2008 Presidential Election (49 out of 50 states correct) 2013 Academy Awards (3 of 4 winners correct) 2012 Presidential Election (50 out of 50 states correct)

A bit of a jump into the deep end IBM s Watson in Action 7 Challenge: Building a computer system that could compete at the human champion level in real time in the American TV quiz show, Jeopardy [Ferrucci et al. 2010]! Watson is an artificial intelligence capable of answering questions in natural language What is Watson? Represented by the IBM s Smarter Planet logo, Watson is ten racks of ten Power 750 servers. Watson s life began five years before the show as a Grand Challenge for IBM (like Deep Blue and Blue Gene before). I, for one, welcome our new computer overlords Ken Jennings' response to losing to an exhibition Jeopardy match to Watson

A bit of a jump into the deep end Change of society - Google replacing grandparents? 8 Grandparent used to be a synonymous with a spring of knowledge called upon to pass down treasures of information to new generations [Emling 2013] Grandparents knowledge: informal knowledge everyday life experience incl. common sense The survey of 1,500 grandparents found that children are increasingly using the internet to answer simple questions. [Telegraph 2013] Towards the next steps in artificial intelligence: Google from an expert system to a machine with common sense? Google Trends [Biermann 2013] Trending How To... 2013, United States 1. How to Tie a Tie 2. How to File 3. How to Get a Passport 4. How to Blog 5. How to Knit 6. How to Kiss 7. How to Flirt 8. How to Whistle 9. How to Unjailbreak 10. How to Vader

Outline 9 I. A Bit of a Jump into the Deep End II. III. IV. In Search of a Definition From Google Search and the 3Vs of Big Data Finally Ending up with Distributed Systems and Artificial Intelligence Expansion Big Facts and High Figures and Applications in the Banking and Financial Markets Methods and Concepts behind the Trend About Challenges their Solutions and Open Issues and the Players in the Game V. A Look Ahead From Big Data to Cyber Physical Systems, the Internet of Things and Industry 4.0 VI. Summary

In search of a definition Let s ask Google 10 Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. Big Data refers to technologies and initiatives that involve data that is too diverse, fastchanging or massive for conventional technologies, skills and infrastructure to address efficiently. Said differently, the volume, velocity or variety of data is too great. But today, new technologies make it possible to realize value from Big Data. Every day, we create 2.5 quintillion bytes of data - so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

Key factors and characteristics The gist of the matter 11? What are the main characteristics of data in Big Data? Volume Velocity Variety Veracity Data at rest Data in motion Data in many forms Data in doubt Terabytes to exabytes of existing data to process Streaming data, milliseconds to seconds to respond e.g. in high frequency trading Structured, unstructured, text, multimedia Uncertainty due to data inconsistency, incompleteness, ambiguities, latency, deception, approximations The 3Vs of Big Data [Gardner 2001] [adapted form Data Science Center]

The crux of the matter Big Data induce Intelligence : From Big Data to Smart Data 12 The Big Data analysis pipeline!!! transfers big data (many ) into smart data (meaningful data) accumulates intelligence from information fragments is a pipeline of aggregating (artificial) intelligence. Acquisition/ Recording Extraction/ Cleaning/ Annotation Integration/ Aggregation/ Representation Analysis/ Modeling Interpretation +

Further characteristics Big Data is distributed 13? Big Data is distributed Generated by a distributed world In multiple domains, applications and users generate data that is (partially) Big Data. Hence, Big Data is generated by a distributed world. Stored in distributed file systems Big Data is structured and unstructured (variability) and its size is enormous (volume). Distributed file systems are required to reliably scale to petabytes of data and thousands of machines. Analyzed by distributed computing The requirements of Big Data analytics regarding volume and velocity can only be satisfied, by distributed computing solutions.

Facts about the infrastructure of Big Data About servers and storage 14 Storing and processing Big Data, a matter of servers and cores We have something over a million servers in our datacenter infrastructure. Google is bigger, Amazon is a little bit smaller. Microsoft s 2013 Worldwide Partner Conference, Steve Ballmer, 2013 An unofficial estimate puts the number of Google servers to more than 2 million. [Pearn 2012] It is estimated that Google owns more than 2% of all the World s servers. [intac 2010] For Amazon Web Services it is estimated that Amazon has at least 454,400 servers in seven data center hubs around the globe. [Miller 2012]

The way so far and beyond Two Worlds coming together 15 Distributed Systems Big data - Volume AI Distributed sources SMART Distributed storage Distributed computing Velocity Distributed Artificial Intelligence Real-time capability Autonomy Variety Veracity Social media data FAST DS Natural language analysis Prediction Smart data Artificial Intelligence

Outline 16 I. A Bit of a Jump into the Deep End II. III. IV. In Search of a Definition From Google Search and the 3Vs of Big Data Finally Ending up with Distributed Systems and Artificial Intelligence Expansion Big Facts and High Figures and Applications in the Banking and Financial Markets Methods and Concepts behind the Trend About Challenges their Solutions and Open Issues and the Players in the Game V. A Look Ahead From Big Data to Cyber Physical Systems, the Internet of Things and Industry 4.0 VI. Summary

Evolution of the term Big Data Is there a definitive date? 17 In parallel : The term software was established in 1958 in the article The American Mathematical Monthly, written by John Tukey? What is about the term Big Data? Is there such a definitive date?! We call this the problem of big data. term first mentioned in a research article 1997 [Cox and Ellsworth 1997] Google Search Trends Big Data? Google Trends for Big Data shows an explosive growth in popularity of this term, starting around 2011 Gregory Piatetsky-Shapiro, Editor at KDnuggets

Evolution of Big Data as a research topic Taking a look at the published papers 18! Big Data in research emerged around 2008 [Halevi and Moed 2012] Number of Big Data papers per year 120 3 Since 2000 the field is led by computer science followed by engineering and mathematics 100 80 1 First appearance of the term in 1970 in an article on atmospheric and oceanic soundings 2 Until 2000 led by computer engineering but also in areas such as building materials, electric and telecommunication 60 40 20 1970s 1980s 1990s 2000s 2010s 0

Evolution of Big Data as a research topic and the related disciplines 19! Big Data research is addressed by multiple disciplines [Halevi and Moed 2012] 1 Top subject area in Big Data research is computer science Computer Science Engineering 171 papers 75 2 3 Other disciplines investigate the topic (like engineering, mathematics, ) Some areas expected to be evident show no significant growth (like chemistry, energy and humanities) Mathematics Business, Management and Accounting Physics and Astronomy Biochemistry, Genetics and Molecular Biology Social Science Materials Science 33 26 23 19 18 15! In fact, there is a growing interest in the development of infrastructure for e-science for humanities Medicine Decision Sciences Multidisciplinary 14 13 13 Arts and Humanities 11

Evolution of Big Data in banking and financial markets How banking and financial market players see Big Data 20 The industry [financial institutions] has been analyzing structured information for many years, but the new growth now is in unstructured data. [Andy Hirst, senior director of Industry Marketing for SAP, at the International SAP Conference for Financial Services in July, 2013] 1 Creating a competitive advantage for financial markets firms 2 Big data activities in banking and financial markets The faster a bank can analyze data, the better the predictive value. 27% 28% [Bryan Yurcan, associate editor for Bank Systems and Technology, 2013] 2012 63% 71% 47% 47% 2011 58% 69% 2010 36% 37% Banking and financial markets Global 26% 24% Banking and Global financial markets Pilot and implementation of big data activities Planning big data activities Have not begun big data activities [Analytics: The real-world use of big data in financial services, IBM, 2012]

Evolution of Big Data in banking and financial markets Uses cases of Big Data in banking and financial markets 21 1 2 3 Fraud Detection Compliance / Monitoring Customer Segmentation By 2016, 25 percent of large global companies will have adopted big data analytics for at least one security or fraud detection use case and will achieve a positive return on investment within the first six months of implementation [Gartner Business Intelligence & Analytics Summit 2014] Dodd-Frank Act, Solvency II and EMIR define new requirements regarding documentation and monitoring. New deal monitoring systems emerge that are based on Big Data technology. What products are they most likely to be interested in? How can they be persuaded toward the right product for both the Financial Institution and themselves? [ ] All of these relate to customer segmentation, but a much more dynamic way to segment customers than have historically been employed. [Oracle, Big Data Analytics: Financial Services Industry Use Cases, 2012] 4 Risk and Trading Analytics Use cases are a summarization of [SAP AG, Top 5 Big Data Uses Cases in Banking and Financial Services, 2014] and [Ruchi Verma and Sathyan R. Mani, Infosys, Use of Big Data Technologies in Capital Markets, 2012] the term big data [ ] made its way into compliance, internal audit and fraud risk management-related publications. [ ] 72% of respondents believe that emerging big data technologies can play a key role in fraud prevention and detection. Yet only 7% of respondents are aware of any specific big data technologies, and only 2% of respondents are actually using them. [Ernst & Young, Big risks require big data thinking, Global Forensic Data Analytics Survey 2014]

Evolution of Big Data in banking and financial markets Sense & responds systems using e.g. Twitter tweets 22 Customer segmentation The idea of segmentation One of the essential topics taught in introductory marketing courses is the concept of market segmentation, which is the division of a market into groups of consumers that share one or more characteristics. [Steve Offsey, CEO MarketBuildr, 2012] Segmentation in the age of Big Data: Understanding the customer s DNA From B2C segmentation (geographic, demographic, ) to more complex segmentation models New types of segments can be derived due to new types of data (activitybased, social network profiles, social influence and sentiment data) Resulting in dynamic micro-segments (identified by data mining and artificial intelligence algorithms) In fact, we believe that a combination of psychology and data science is the only way for marketing leaders to unlock value from insights that are unlikely to be found purely through data mining. [Punchh Launches Big Data Customer Segmentation 2013] [Big Data: from mining tomeaning, Sandra Pickering] There s much buzz around big data everywhere. But the big problem is that until now, big data analytics has mostly been about tools for large companies with big budgets.

Evolution of Big Data in banking and financial markets Sense & Responds systems using Twitter tweets 23 Trading Analytics Changing trading strategies Capital markets have evolved from simple strategies, like 1980s-paired models, to the intricate gaming strategies of today. Trading strategies have started including unstructured data. [Ruchi Verma and Sathyan R. Mani, Infosys, Use of Big Data Technologies in Capital Markets, 2012] Example: Evolution of Sense & Response System to handle responsiveness Social media meets financial services Dataminr transforms the Twitter stream into actionable alerts, identifying the most relevant information in real-time for clients in Finance, News and the Public Sector. [dataminr.com, 2014] On November 11, 2013, a few minutes after 8 a.m. EST, news leaked out from a Canadian newspaper that Blackberry s $4.7 billion buyout had collapsed. Wall Street wouldn t find out for a full 180 seconds, when the newswires picked up the report in real time. [Brian O Connell, Can Tweets And Facebook Posts Predict Stock Behavior?, 2014] On March 8, a Royal Caribbean cruise ship arrived in Port Everglades, Florida, with 105 passengers and three crew members sick with norovirus. When that news broke, it sent Royal Carribean Cruises Ltd. s share prices tumbling by 2.9%. But Dataminr clients had the news 48 minutes earlier. [Stan Alcorn, Twitter Can Predict The Stock Market, If You're Reading The Right Tweets, 2013]

Outline 24 I. A Bit of a Jump into the Deep End II. III. IV. In Search of a Definition From Google Search and the 3Vs of Big Data Finally Ending up with Distributed Systems and Artificial Intelligence Expansion Big Facts and High Figures and Applications in the Banking and Financial Markets Methods and Concepts behind the Trend About Challenges their Solutions and Open Issues and the Players in the Game V. A Look Ahead From Big Data to Cyber Physical Systems, the Internet of Things and Industry 4.0 VI. Summary

Challenge Scale Data-parallel models - Fynn classification of distributed systems 25 This is all about parallelization and distributed computation regular desktop computer (one single CPU; not a distributed system!) Do the CPUs apply the same instruction on all data oder different ones? Single Instruction SISD Multiple Instruction MISD Do all CPUs have their one storage, or are they sharing it? Single Data Multiple Data SIMD MIMD

Challenge Scale MapReduce - first major data-processing paradigm 26 Example Extreme high degree of parallelization possible Word recognition Re-arrangement of results by words (before: by documents) Summarizing all findings (by words) Parallel handling of several documents Doc 1 word1 word2 word2 word3 Doc 2 word2 word2 word1 word3 Map Map (word1, 1) (word2, 1) (word2, 1) (word3, 1) (word2, 1) (word2, 1) (word1, 1) (word3, 1) (word1, (1,1)) (word2, (1,1,1,1)) (word3, (1,1)) Reduce Reduce Reduce (word1, 2) (word2, 4) (word3, 2) Result: word frequency of different words in the collection of documents Input Map Summarize, Group, Share Reduce Result

Challenge Scale The Google Story - Data Analysis beyond MapReduce 27 The Google File System by Google published 2003 A large distributed file system, files are split into chunks and stored in a redundant fashion MapReduce by Google published 2004 THE distributed search algorithm, highly parallelizable Bigtable 2006 by Google published 2006 High performance no-sql data base incl. timestamps, thus keeping old versions (history) [M. Braun, TU Berlin, 2013] Percolator by google published 2010 Describes how web search index is keeped up to date, acting on top of Bigtable Pregel by Google published 2010 Mining graph data: system for large-scale graph processing Dremel by Google published 2010 Basis for Online visualizations. Acting on JSON object instead on tables with fixed fields; core element of BigQuery

Challenge Scale Dremel Fast Analysis of Nested Data [Tech Mortal, 2013] 28 Speed matters most! Background: For large data amount, batch processing is becoming slower and slower. Alternatively, dialogue-based processing structures are desired which focus the search on the relevant part of the data GOAL METHOD TODAY Real-time interactive analysis of massive datasets queries with response times below 20 minutes columnar storage for fast data scanning tree architecture for dispatching queries and aggregating results across huge computer clusters SQL like queries - more realistic for speed increase. Dremel: core element in Googles BigQuery engine realized as a SaaS in a cloud Google BigQuery allows users to conduct big data analysis, with no need to operate a data center. r 1 r 1 A B E r 2 r 2 C D From record-oriented back to column-oriented Nested data structure

Comparison of hypes NoSQL Data Base Models 29 NoSQL not only SQL (not: no SQL) Background: relational data bases are efficient for many but small transaktionen or for large transactions with rare writing processes. They are bad if it comes to many transaktionen AND many writing processes at the same time. Data model: beyond the SQL-standards based on relational databases, treelike structures etc. come into place (no need to press data into structures where they do not fit ); Documents with nested structures instead of tables Horizontal Scalability: easy since build on distributed architectures simply by adding additional nodes Famous examples: Google BigTable, Amazon Dynamo, several open source products as e.g. MongoDB Media types: text, binary large objects (picture, video, audio), nested structures as JSON objects, YAML etc. Improvements: flexible schema and automatic indexes, and: optimized for Full Text Search History: Development since 1998 (older name: document-oriented databases) Current development Standard SQL data bases are integrating additional nosql features

Triumph of Big Data computing Natural Language Processing: IBM s DeepQA and Watson 30 Artificial intelligence meets Big Data! Problem: The open-domain Question-Answer problem Domains: Information retrieval, natural language processing, knowledge representation, reasoning, machine learning, computer-human interfaces X ran this? demo: If leadership is an art then surely Jack Welch has proved himself a master painter during his tenure at GE. Background: Consider, for example, the Computer in Star Trek. Taken to its ultimate form, broad and accurate open-domain question answering may well represent a crowning achievement for the field of Artificial Intelligence (AI). [IBM 2011]! Solution: DeepQA A massively parallel probabilistic evidence-based architecture 3 years of effort 20 researchers and software engineers operated within the winner s cloud 3 seconds response time [Ferrucci et al. 2010]

Technology timeline The early years (1996-2004) 31 Web data Social data 1996 1997 1998 1999 2000 2001 2002 2003 2004 Berkeley DB Apache Lucene Tokyo Cabinet Lucene OpenSource Nutch Apache Solr Google Search Google FileSystem Google MapReduce Larry Page and Sergey Brin begin their research project at Stanford (later: Google) Doug Cutting writes Lucene (key component of Nutch, Solr and ElasticS earch ); opensource search framework Larry Page and Sergey Brin present the large-scale Hypertextual Web Search Engine Google Doug Cutting and Mike Cafarella present Nutch (search engine, formerly part of Lucene) Google releases a paper about applying MapReduce on Large Data Clusters Red box: open source Bluebox: Google Data handling / storage Search High-level language Distributed computation adopted from [Outliers 2013]

Technology timeline Today (since 2005) 32 Mobile data 2005 2006 2007 2008 2009 2010 2011 2012 2013 Apache Hadoop Apache Pig Mongo DB Apache Hive Disco Project Elastic Search Apache Storm Dynamo DB Orient DB Couch DB Google BigTable Apache HBase Redis Apache Spark Apache Cassandra Google Dremel Doug Cutting and Mike Cafarella write Hadoop, a framework derived from Google s MapReduce Pig originated by Yahoo became an opensource project of the Apache Foundation. Google releases a paper /Melnik et.al.) about applying Dremel and BigQuery Apache Storm, an event processor and distributed computation framework is released - Clojure programming language, provides similar functionality as MapReduce Red box: open source Blue box: Google Data handling / storage Search High-level language Distributed computation adopted from [Outliers 2013]

Outline 33 I. A Bit of a Jump into the Deep End II. III. IV. In Search of a Definition From Google Search and the 3Vs of Big Data Finally Ending up with Distributed Systems and Artificial Intelligence Expansion Big Facts and High Figures and Applications in the Banking and Financial Markets Methods and Concepts behind the Trend About Challenges their Solutions and Open Issues and the Players in the Game V. A Look Ahead From Big Data to Cyber Physical Systems, the Internet of Things and Industry 4.0 VI. Summary

A look ahead About the potential of Big Data in social science 34 NSF 2011-2016 Strategic Plan The revolution in information and communications technologies is another major factor influencing the conduct of 21st century research. New cyber tools for collecting, analyzing, communicating, and storing information are transforming the conduct of research and learning. One aspect of the information technology revolution is the data deluge, shorthand for the emergence of massive amounts of data and the changing capacity of scientists and engineers to maintain and analyze it. The new availability of data, presents a huge potential for researchers in social science Peter Doorn, director at Data Archiving and Network Services In social science research, there is a great tradition of survey methodology with people doing interviews about all kinds of ideas people may have. However, a new approach is to do things like a sentiment analysis on Twitter posts, for example. This is a totally new way of getting knowledge about what is going on in society.

A Paradigm shift to computational social science Big Data opens new opportunities for humanities 35 Big Data can help to reduce Runkel & McGrath s three-horned dilemma [Chang et al. 2013] Obtrusive research operations Control is more fully achieved with realism and generality Little control is given up outside the lab Judgement Tasks Field Experiments Lower in costs The three horns of Runkel/McGrath s framework are realism, generality, precision. Because of data collection limitations of the traditional data collection methods, no method can be general, realistic and precise all at the same time. research metrologies are dilemmatic because of the nature of data collection performed in each research approach. [Kaufmann & Wood 2003] Unobtrusive research operations Much easier to do now Universal behavior systems Formal Theory Computer Simulations Particular behavior systems Realism is supported by control, generality more fully than before Overcoming scholastical differences between qualitative and quantitative research methods Enhances Grounded Theory picture adapted from [Chang, Kauffman and Kwon, 2013]! (We) should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data. [Halevy et al. 2009]

A bit of a jump into the deep end Predicting human behavior: Transparent consumers 36 How Target figured out a teen girl was pregnant before her father did Unique Target Id Each interaction with retailer is assigned to that id Group of pregnant customers Customer Coupon campaign Customer profiles Clustering customers into groups, for example to identify disruptions in life (e. g. weddings, job changes and pregnancy) Andrew Pole Statistician working for Target Pole identified about 25 products that allowed him to assign each customer a pregnancy prediction score and the estimated due date

A look ahead Roboter Recruiting : Don't call us, we'll call you 37 in more and more companies, computer algorithms are part of the employment of new workers [Handelsblatt 03/2014]! CV data are combined with success data of the particular company or field Germany: about 40% USA: > 90% Great developers are everywhere, and Gild can prove it. on www.gild.com Fairness? different mental models between human and computer software based selection is incapable of analyzing true motivation, extraordinary engagement etc. Talents might be overlooked / lost But: Selection shows a higher degree of equal opportunities regarding gender, age, culture, etc. Selection shows a higher degree of tolerance in respect of disruptions in the CV

Multi- and interdisciplinary challenge The Sexiest Job of the 21st Century 38 Transforming data into business value: The data scientist. It s a high-ranking professional with the training and curiosity to make discoveries in the world of big data. [Davenport and Patil 2012] A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It's almost like a Renaissance individual who really wants to learn and bring change to an organization. [IBM 2014]! Demand for people with deep expertise in data analysis [McDonnell 2011] 2008 employment Forecast of graduates 156 thousands +161 Towards artificial intelligence: and with a sound knowledge on machine learning, natural language processing, etc. etc. Adjustments (reemployment & attrition) Projected 2018 supply -32 285 425-475 50-60% gap

Facts about the digital universe Some Facts on Data in the digital universe 39 Fact 1 From 2005 to 2020, the digital universe will grow by a factor of 300, (more than 5,200 gigabytes for every man, woman, and child). From now until 2020, the digital universe will about double every two years. [IDC 2012] The Digital Universe [IDC 2012] Fact 2 The investment in spending on IT considered the "infrastructure" of the digital universe will grow by 40% between 2012 and 2020. As a result, the investment per GB will drop from $2.00 to $0.20. [IDC 2012] Fact 3 Fact 4 A majority of the information in the digital universe, 68% in 2012, is created by consumers [ ]. Yet enterprises have liability or responsibility for nearly 80% They deal with issues of copyright, privacy, and compliance [IDC 2012] Only a tiny fraction of the digital universe has been explored for analytic value. By 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed. [IDC 2012] Consumer-generated (1,934 EB) Enterprise Touch (2,225 EB) Overlap (1,342 EB) Digital Universe Useful if tagged & analyzed

The fourth industrial (r)evolution Big Data meets Industry 4.0 - Everybody & everything is networked 40 The first three industrial revolutions came about as a result of mechanisation, electricity and IT. The introduction of the Internet of Things is ushering in a fourth industrial revolution. Industry 4.0 will address and solve some of the challenges facing the world today such as resource and energy efficiency, urban production and demographic change. Henning Kagermann et.al., acatech, 2013 Vision of Wireless Next Generation System (WiNGS) Lab at the University of Texas at San Antonio, Dr. Kelley Weidmüller, Vision 2020 - Industrial Revolution 4.0 Intelligently networked, self-controlling manufacturing systems local to global local to global around 1750 around 1900 around 1970 today 1 st industrial revolution Mechanical production systematically using the power of water and steam Power revolution Centralized electric power infrastructure; mass production by division of labor Digital revolution Digital computing and communication technology, enhancing systems intelligence Information revolution Everybody and everything is networked networked information as a huge brain

Cyber-Physical Systems Towards complex and networked social-technical systems 41 let s have a look Communication Consumer Energy Infrastructure Health Care Manufacturing Military Robotics Transportation [CAR2CAR, 2011] and [ConnectSafe, 2011]

The fourth industrial (r)evolution Not Restricted to Industry: Cyber Physical Systems in All Areas 42 Back to: The earth converted into a huge brain Tesla 1926 Integrating complex information from multiple heterogenous sources opens multiple possibilities of optimization: e.g. energy consumption, security services, rescue services as well as increasing the quality of life Building automation Smart metering Smart grid Room automation Smart environment and more

Outline 43 I. A Bit of a Jump into the Deep End II. III. IV. In Search of a Definition From Google Search and the 3Vs of Big Data Finally Ending up with Distributed Systems and Artificial Intelligence Expansion Big Facts and High Figures and Applications in the Banking and Financial Markets Methods and Concepts behind the Trend About Challenges their Solutions and Open Issues and the Players in the Game V. A Look Ahead From Big Data to Cyber Physical Systems, the Internet of Things and Industry 4.0 VI. Summary

Summary Big Data - Underneath the hype Google Car 2012 44 So what exactly is the real revolution? It s not data. It s being data-driven. Collect and organize data and use tools to extract information and gain insights Quickly and effectively test the derived hypothesis to prove cause-and-effect The real data revolution won t be a sugar-coated miracle pill that anyone can adopt simply by buying some software, hiring a data scientist, and a cloud full of data. Many organizations will be usurped by new competitors who grow up natively with this new worldview. [Brinker 2013] Collect from distributed sources and organize in distributed systems Social media and crowd data collection Distributed storage, querying and analysis Derive and test hypothesis and analyze for insights Hypothesis /evidence scoring based on evidence models Sense and response Predictive analysis Distributed systems technically and socially Distributed artificial intelligence Artificial Intelligence smart computer systems

45 Thank you! Univ.-Prof. Dr. rer. nat. Sabina Jeschke Head of Institute Cluster IMA/ZLW & IfU phone: +49 241-80-91110 sabina.jeschke@ima-zlw-ifu.rwth-aachen.de Co-authored by: Dr.-Ing. Tobias Meisen Institute Cluster IMA/ZLW & IfU phone: +49 241-80-91139 tobias.meisen@ima-zlw-ifu.rwth-aachen.de www.ima-zlw-ifu.rwth-aachen.de

Prof. Dr. rer. nat. Sabina Jeschke 46 1968 Born in Kungälv/Schweden 1991 Birth of Son Björn-Marcel 1991 1997 Studies of Physics, Mathematics, Computer Sciences, TU Berlin 1994 NASA Ames Research Center, Moffett Field, CA/USA 10/1994 Fellowship Studienstiftung des Deutschen Volkes 1997 Diploma Physics 1997 2000 Research Fellow, TU Berlin, Institute for Mathematics 2000 2001 Lecturer, Georgia Institute of Technology, GA/USA 2001 2004 Project leadership, TU Berlin, Institute for Mathematics 04/2004 Ph.D. (Dr. rer. nat.), TU Berlin, in the field of Computer Sciences from 2004 Set-up and leadership of the Multimedia-Center at the TU Berlin 2005 2007 Juniorprofessor New Media in Mathematics & Sciences & Director of the Media-center MuLF, TU Berlin 2007 2009 Univ.-Professor, Institute for IT Service Technologies (IITS) & Director of the Computer Center (RUS), Department of Electrical Engineering, University of Stuttgart since 06/2009 Univ.-Professor, Institute for Information Management in Mechanical Engineering (IMA) & Center for Learning and Knowledge Management (ZLW) & Institute for Management Cybernetics (IfU), RWTH Aachen University since 10/2011 Vice dean of the department of Mechanical Engineering, RWTH Aachen University since 03/2012 Chairwoman VDI Aachen