Principles of E-Commerce I: Business and Technology. (PoE1) Focus: Big Data Platforms Prof. Roberto V. Zicari



Similar documents
Big Data: A data-driven society?

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Transforming the Telecoms Business using Big Data and Analytics

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Getting Started Practical Input For Your Roadmap

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

HDP Hadoop From concept to deployment.

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data Explained. An introduction to Big Data Science.

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

The 4 Pillars of Technosoft s Big Data Practice

Hadoop. Sunday, November 25, 12

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Real Time Big Data Processing

How To Scale Out Of A Nosql Database

COMP9321 Web Application Engineering

The Big Deal about Big Data. Mike Skinner, CPA CISA CITP HORNE LLP

The Future of Data Management

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Hadoop Ecosystem B Y R A H I M A.

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Big Data a threat or a chance?

BIG DATA TRENDS AND TECHNOLOGIES

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Large scale processing using Hadoop. Ján Vaňo

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

BIG DATA What it is and how to use?

Applications for Big Data Analytics

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Apache Hadoop: The Big Data Refinery

Big Data and Analytics: Challenges and Opportunities

BIG DATA CHALLENGES AND PERSPECTIVES

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data Challenges and Success Factors. Deloitte Analytics Your data, inside out

So What s the Big Deal?

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Ramesh Bhashyam Teradata Fellow Teradata Corporation

Big Data Are You Ready? Thomas Kyte

Sunnie Chung. Cleveland State University

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

How To Handle Big Data With A Data Scientist

Data Warehouse design

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Tap into Hadoop and Other No SQL Sources

Native Connectivity to Big Data Sources in MSTR 10

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

The Big Data Paradigm Shift. Insight Through Automation

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

HDP Enabling the Modern Data Architecture

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Big Data Analytics Nokia

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Data processing goes big

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 12

The Next Wave of Data Management. Is Big Data The New Normal?

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Integrating a Big Data Platform into Government:

The Internet of Things and Big Data: Intro

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Are You Ready for Big Data?

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Advanced In-Database Analytics

Bringing Big Data Modelling into the Hands of Domain Experts

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Oracle Big Data SQL Technical Update

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop IST 734 SS CHUNG

Big Data Integration: A Buyer's Guide

Big Data and Data Science: Behind the Buzz Words

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data Big Data/Data Analytics & Software Development

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Search and Real-Time Analytics on Big Data

Advanced Big Data Analytics with R and Hadoop

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

The Future of Data Management with Hadoop and the Enterprise Data Hub

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Big Data. What is Big Data? Over the past years. Big Data. Big Data: Introduction and Applications

WA2192 Introduction to Big Data and NoSQL EVALUATION ONLY

Challenges for Data Driven Systems

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

A Survey on Big Data Concepts and Tools

Transcription:

Principles of E-Commerce I: Business and Technology. (PoE1) Focus: Big Data Platforms Prof. Roberto V. Zicari with support of Todor Ivanov, Marten Rosselli and Dr. Karsten Tolle 2015 SS

Principles of E-Commerce I (PoE1) Focus: Big Data Platforms Responsible: Prof. Roberto V. Zicari with support of Todor Ivanov, Marten Rosselli and Dr. Karsten Tolle Time and location: Tuesday 10:15-11:45, SR 11 (Informatikgebäude) Wednesday 10:15-11:45, SR 307 (Informatikgebäude) Goethe University Frankfurt Institute for Computer Science DBIS

Basic information Webpage Frankfurt Big Data Lab: http://www.bigdata.unifrankfurt.de/principles-e-commerce-ss2015/ Homepage DBIS: http://www.dbis.cs.uni-frankfurt.de E-Mail: poe@dbis.cs.uni-frankfurt.de Attention: We will try to announce changes and news on the webpage. You should have a look before each lecture. Goethe University Frankfurt Institute for Computer Science DBIS

Hands-on This course is a hands-on course. The exercise and lecture parts are mixed as needed. Please make sure that you bring at least one notebook for two persons for each course slot. (In case this is a problem send an email to poe@dbis.cs.uni-frankfurt.de.) Goethe University Frankfurt Institute for Computer Science DBIS

How to get the CPs (6) and the final score Each participant needs to do four practical assignments. Make a presentation of the results. Details will follow. Registration: With the first assignment, we will collect your data (Name, Matrikelnummer und Studiengang) for the registration to this course. Goethe University Frankfurt Institute for Computer Science DBIS

Schedule (preliminary, please check Web site!) Prof. Dott. Ing. Roberto V. Zicari - Intro Todor Ivanov - Hadoop Dr. Karsten Tolle - GraphDBs Marten Rosselli - NoSQL Students - Presentations Tuesday Wednesday 1 14. April 2015 4/15/2015 Intro to Big Data 2 21. April 2015 Intro to Hadoop 1 - HDFS & MapReduce 4/22/2015 Intro to Hadoop 2 - HDFS & MapReduce 3 28. April 2015 Hadoop Ecosystem 1 4/29/2015 Hadoop Ecosystem 2 4 5. Mai 2015 Data Acquisition 1 5/6/2015 Data Acquisition 2 5 12. Mai 2015 Graphs, GraphDBs 5/13/2015 Semantic Web, LOD 6 19. Mai 2015 Intro to Pig 1 5/20/2015 Intro to Pig 2 7 26. Mai 2015 Student Presentations 1 5/27/2015 Student Presentations 2 8 2. Juni 2015 Advanced Pig 1 6/3/2015 Intro to Hive 1 9 9. Juni 2015 Intro to Hive 2 6/10/2015 Advanced Hive 1 10 16. Juni 2015 NoSQL 6/17/2015 NoSQL 11 23. Juni 2015 NoSQL - Exercise 6/24/2015 NoSQL - Exercise 12 30. Juni 2015 Impala 7/1/2015 New Big Data Technologies - Spark 13 7. Juli 2015 NoSQL 7/8/2015 NoSQL 14 14. Juli 2015 Student Presentations 3 7/15/2015 Student Presentations 4 Goethe University Frankfurt Institute for Computer Science DBIS

Ringvorlesung Focus Big Data, Internet of Things and Data Science Series of 10 guest lectures, open to all. Course start/end: Thursday, 23.04.2015 to Thursday, 16.07.2015 Time: Every Thursdays, 14:15 15:45 Location: Robert-Mayer-Straße 11-15, Room SR 307 (Informatikgebäude) Webpage: http://www.bigdata.uni-frankfurt.de/soft-skills-entrepreneurship-m-ssk-bsos/ Frankfurt Big Data Lab: http://www.bigdata.uni-frankfurt.de Goethe University Frankfurt Institute for Computer Science DBIS

Ringvorlesung Schedule Date Speaker Title 30.04.2015 Prof. Roberto V. Zicari, Frankfurt Big Data Lab, Goethe University Frankfurt Big Data: A Data Driven Society? 07.05.2015 Prof. Nikos Korfiatis, Assistant Professor of Business Analytics, Norwich Business School, University of East Anglia, UK Big Data and Regulation 21.05.2015 Dr. Alexander Zeier, Managing Director, Globally for In-Memory Solutions at Accenture In-Memory Technologies and Applications: S/4 HANA 28.05.2015 Jörg Besier, Managing Director at Accenture, Digital Delivery Lead ASG Towards a data-driven economy. How Big Data fuels the digital economy 11.06.2015 Klaas Wilhelm Bollhoefer, Chief Data Scientist, The unbelievable Machine Company Introduction to Data Science 18.06.2015 Prof. Dr. Katharina Morik, TU Dortmund University Big Data Analytics in Astrophysics 25.06.2015 02.07.2015 Prof. Hans Uszkoreit, Scientific Director, German Research Center for Artificial Intelligence (DFKI) Matthew Eric Bassett, Director and Co-Founder of Gower Street Analytics. Former Director, Data Science at NBCUniversal International, UK Smart Data Web Value chains for industrial applications Data Science and the future of the movie business 09.07.2015 Prof. Dr. Christoph Schommer, University of Luxembourg Algorithms for Data Privacy 16.07.2015 Thomas Jarzombek, Mitglied des Deutschen Bundestages Big Data and its challenges for today s politics Goethe University Frankfurt Institute for Computer Science DBIS

Big Data slogans Big Data: The next frontier for innovation, competition, and productivity (McKinsey Global Institute) Data is the new gold Open Data Initiative, European Commission (aim at opening up Public Sector Information). 9

What Data? The term Big Data" refers to large amounts of different types of data produced with high velocity from a high number of various types of sources. Handling today's highly variable and real-time datasets requires new tools and methods, such as powerful processors, software and algorithms. The term Open Data" refers to a subset of data, namely to data made freely available for re-use to everyone for both commercial and non-commercial purposes. Linked Data is about using the Web to connect related data that wasn't previously linked, or using the Web to lower the barriers to linking data currently linked using other methods. 10

This is Big Data. Every day, 2.5 quintillion bytes (=2,5 exabytes) of data are created. This data comes from: digital pictures, videos, posts to social media sites, intelligent sensors, purchase transaction records, cell phone GPS signals to name a few. In 2013, estimates reached 4 zettabytes of data generated worldwide (*) Mary Meeker and Liang Yu, Internet Trends, Kleiner Perkins Caulfield Byers, 2013, http://www.slideshare.net/kleinerperkins/kpcb-internet-trends-2013. 11

Source http://aci.info/2014/07 /12/the-data- explosion-in-2014- minute-by-minuteinfographic/

How Big is Big Data? 1 megabyte = 1,000,000 =10 6 bytes 1 gigabyte = 10 9 bytes 1 terabyte = 1,000,000,000,000 bytes = 10 12 bytes ------------------------------------------------------------------ 1 petabyte is 1,000 terabytes (TB) =10 15 bytes 1 exabyte = 10 18 bytes 1 zettabyte is 1,000 000,000,000,000,000,000== 10 21 bytes Imagine that every person (320,590,000) in the United States took a digital photo every second of every day for over a month. All of those photos put together would equal about one zettabyte (*) (*) BIG DATA: SEIZING OPPORTUNITIES, PRESERVING VALUES Executive Office of the President, MAY 2014 -The White House, Washington. 13

Another Definition of Big Data Big Data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze (McKinsey Global Institute) This definition is Not defined in terms of data size (data sets will increase) Vary by sectors (ranging from a few dozen terabytes to multiple petabytes) 14

Examples of gigabyte-sized storage( source Wikipedia) One hour of Standard-definition television SDTV- video at 2.2 Mbit/s is approximately 1 GB. Seven minutes of High-definition television HDTV- video at 19.39 Mbit/s is approximately 1 GB. 114 minutes of uncompressed Compact disc -CD-quality audio at 1.4 Mbit/s is approximately 1 GB. A digital optical disc storage DVD-R single- layer can hold about 4.7 GB. A dual-layered Blu-ray Disc -digital optical disc data storage Blu-ray disc- can hold about 50 GB. 15

Examples of the use of terabyte (source Wikipedia) Audio: One terabyte of audio recorded at CD quality contains approx. 2000 hours of audio. Climate science: In 2010, the German Climate Computing Centre (DKRZ) was generating 10000 TB of data per year Video: Released in 2009, the 3D animated film Monsters vs. Aliens used 100 TB of storage during development The Hubble Space Telescope has collected more than 45 terabytes of data in its first 20 years of observations. Historical Internet traffic: In 1993, total Internet traffic amounted to approximately 100 TB for the year. As of June 2008, Cisco Systems estimated Internet traffic at 160 TB/s (which, assuming to be statistically constant, comes to 5 zettabytes for the year). In other words, the amount of Internet traffic per second in 2008 exceeded all of the Internet traffic in 1993. 16

Examples of the use of the petabyte (source Wikipedia) Databases: Teradata Database 12 has a capacity of 50 petabytes of compressed data Data mining: In August 2012, Facebook's Hadoop clusters include the largest single HDFS cluster known, with more than 100 PB physical disk space in a single HDFS filesystem. Yahoo stores 2 petabytes of data on behavior. Telecommunications (usage): AT&T transfers about 30 petabytes of data through its networks each day. Internet:Google processed about 24 petabytes of data per day in 2009 Data storage system: In August 2011, IBM was reported to have built the largest storage array ever, with a capacity of 120 petabytes. 17

Examples of the use of the petabyte (source Wikipedia) Photos: As of January 2013, Facebook users had uploaded over 240 billion photos, with 350 million new photos every day. For each uploaded photo, Facebook generates and stores four images of different sizes, which translated to a total of 960 billion images and an estimated 357 petabytes of storage. Music: One petabyencoded songs te of averagemp3- (for mobile, roughly one megabyte per minute), would require 2000 years to play. Games: World of Warcraft uses 1.3 petabytes of storage to maintain its game. Physics:experiments in the Large Hadron Collider produce about 15 petabytes of data per year Climate science:german Climate Computing Centre (DKRZ) has a storage capacity of 60 petabytes of climate data. 18

The Internet of Things The Internet of Things is a term used to describe the ability of devices to communicate with each other using embedded sensors that are linked through wired and wireless networks. These devices could include your thermostat, your car, or a pill you swallow so the doctor can monitor the health of your digestive tract. These connected devices use the Internet to transmit, compile, and analyze data (*) (*) BIG DATA: SEIZING OPPORTUNITIES, PRESERVING VALUES Executive Office of the President, MAY 2014 -The White House, Washington. 19

What is Big Data supposed to create? Value (McKinsey Global Institute): Creating transparencies Discovering needs, expose variability, improve performance Segmenting customers Replacing/supporting human decision making with automated algorithms Innovating new business models,products,services 20

http://jtonedm.com/2013/06/05/business-value-of-big-data-and-analytics-bda13/ 21

The value of big data Source: http://www.cmswire.com/cms/information-management/big-data-smart-data-and-the-fallacy-that-lies-between- 017956.php#null

What is Data Science? (sourcehttp://datascience.nyu.edu/what-is-data-science/ Data science involves using automated methods to analyze massive amounts of data and to extract knowledge from them. One way to consider data science is as an evolutionary step in interdisciplinary fields like business analysis that incorporate computer science, modeling, statistics, analytics, and mathematics. 23

Big Data Analytics (source: http://community.lithium.com/t5/science-of-social-blog/big-data-reduction-2-understanding-predictive- Analytics/ba-p/79616 1. Descriptive Analytics The purpose of descriptive analytics is simply to summarize and tell you what happened. simplest class of analytics that you can use to reduce big data into much smaller, but consumable bites of information. Compute descriptive statistics (i.e. counts, sums, averages, percentages, min, max and simple arithmetic: + ) that summarizes certain groupings or filtered version of the data, which are typically simple counts of some events. They are mostly based on standard aggregate functions in databases 24

Big Data Analytics (source: http://community.lithium.com/t5/science-of-social-blog/big-data-reduction-2-understanding-predictive- Analytics/ba-p/79616 2. Predictive Analytics The purpose of predictive analytics is NOT to tell you what will happen in the future. It cannot do that. In fact, no analytics can do that. Predictive analytics can only forecast what might happen in the future, because all predictive analytics are probabilistic in nature. forecasting: 25

Examples of Non-Temporal Predictive Analytics An example of non-temporal predictive analytics where a model uses someone s existing social media activity data (data we have) to predict his/her potential to influence (data we don t have). Another well-known example of non-temporal predictive analytics in social analytics is sentiment analysis. (sourcehttp://community.lithium.com/t5/science-of-social-blog/big-data-reduction-2-understanding-predictive- Analytics/ba-p/79616 26

Big Data Analytics souce: http://community.lithium.com/t5/science-of-social-blog/big-data-reduction-3-from-descriptive-to-prescriptive/ba-p/81556 3. Prescriptive Analytics Prescriptive analytics not only predicts a possible future, it predicts multiple futures based on the decision maker s actions. A prescriptive model can be viewed as a combination of multiple predictive models running in parallel, one for each possible input action. 27

This predictive model must have two more added components in order to be prescriptive: source: http://community.lithium.com/t5/science-of-social-blog/big-data-reduction-3-from-descriptive-to-prescriptive/ba-p/81556 Actionable: The data consumers must be able to take actions based on the predicted outcome of the model Feedback System: The model must have a feedback system that tracks the adjusted outcome based on the action taken. This means the predictive model must be smart enough to learn the complex relationship between the user s action and the adjusted outcome through the feedback data 28

How Big Data will be used? Combining Data together is the real value for corporations: 90% corporate data 10% social media data Sensors data just begun (e.g. smart meters) Key basis of competition and growth for individual firms (McKinsey Global Institute). 29

http://cdn.ttgtmedia.com/rms/onlineimages/bi_0814_page7_graphic1.png 30

Examples of BIG DATA USE CASES Log Analytics Fraud Detection Social Media and Sentiment Analysis Risk Modeling and Management 31

http://www.loadedtech.com.au/blog/bid/156700/big-data-survey-brings-new-insights 32

Big Data can generate financial value(*) across sectors, e.g. Health care Public sector administration Global personal location data Retail Manufacturing (McKinsey Global Institute) (*)Note: but it could be more than that! 33

Limitations Shortage of talent necessary for organizations to take advantage of big data. Very few PhDs. Knowledge in statistics and machine learning, data mining. Managers and Analysts who make decision by using insights from big data. Source: McKinsey Global Institute 34

Smart Data? Big data provides the infrastructure for economically storing and processing unprecedented amount of data. But undigested big data (e.g. terabytes of raw logs) and the technology required for it (e.g. Hadoop, Cassandra, etc.) is pretty much inaccessible to the average business person. There is a huge disconnect between what big data provides and what businesses need. Smart data is how you can fill the gap Source: http://www.cmswire.com/cms/information-management/big-data-smart-data-and-the-fallacy-that-lies-between- 017956.php#null 35

http://tarrysingh.com/2014/07/fog-computing-happens-when-big-data-analytics-marries-internet-of-things/ 36

Big Data: What are the consequences? Any technological or social force that reaches down to affect the majority of society`s members is bound to produce a number of controversial topics (John Bittner, 1977) But, what are the true consequences of a society being reshaped by systematically building on data analytics? 37

Big Data: Challenges 1. Data 2. Process 3. Management 38

Data Challenges Volume: dealing with the size of it In the year 2000, 800,000 petabytes (PB) of data stored in the world (source IBM). Expect to reach 35 zettabytes (ZB) by 2020. Twitter generates 7+ terabytes (TB) of data every day. Facebook 10TB. Scale and performance requirements strain conventional databases. Scalability has three aspects: Data Volume, Hardware Size, and Concurrency. 39

Analytics Data Platform for Big Data Mike Carey (EDBT Keynote 2012): Big Data in the Database World (early 1980s till now) - Parallel Data Bases. Shared-nothing architecture, declarative set-oriented nature of relational queries, divide and conquer parallelism (e.g. Teradata) - Re-implemention of relational databases (e.g. HP/Vertica, IBM/Netezza, Teradata/ Aster Data,EMC/ Greenplum.) Big Data in the Systems World (late 1990s) - Apache Hadoop (inspired by Google GFS, MapReduce, contributed by large Web companies.e.g. Yahoo!, Facebook - Google BigTable, - Amazon Dynamo. 40

Data Challenges Variety: handling multiplicity of types, sources and formats Sensors, smart devices, social collaboration technologies. Data is not only structured, but raw, semi structured, unstructured data from web pages, web log files (click stream data), search indexes, e-mails, documents, sensor data, etc. 41

Structured Data Employee EmpNo Ename DeptNo DeptName 100 Bob 10 Marketing 200 Bob 20 Purchasing 150 Peter 10 Marketing 170 Doug 20 Purchasing 105 John 10 Marketing 42

Clickstream Data Clickstream data is an information trail a user leaves behind while visiting a website. It is typically captured in semi-structured website log (source http://www.jafsoft.com/searchengines/log_sample.html) and( http://hortonworks.com) fcrawler.looksmart.com - - [26/Apr/2000:00:00:12-0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1- pre2 (ashen@looksmart.net)" fcrawler.looksmart.com - - [26/Apr/2000:00:17:19-0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST- WebCrawler/2.1-pre2 (ashen@looksmart.net)" ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12-0400] "GET /download/windows/asctab31.zip HTTP/1.0" 200 1540096 "http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html" "Mozilla/4.7 [en]c-sympa (Win95; U)" 123.123.123.123 - - [26/Apr/2000:00:23:48-0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:47-0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/computers/data_formats/document/text/rtf" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:48-0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:50-0400] "GET /pics/5star.gif HTTP/1.0" 200 1031 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:51-0400] "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:51-0400] "GET /cgi-bin/newcount?jafsof3&width=4&font=digital&noshow HTTP/1.0" 200 36 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 43

Potential Uses of Clickstream Data (source http://hortonworks.com) One of the original uses of Hadoop at Yahoo was to store and process their massive volume of clickstream data. Now enterprises can use Hadoop Data Platforms (HDP) to refine and analyze clickstream data. They can then answer business questions such as: What is the most efficient path for a site visitor to research a product, and then buy it? What products do visitors tend to buy together, and what are they most likely to buy in the future? Where should I spend resources on fixing or enhancing the user experience on my website? 44

Variety (cont.) A/B testing (two versions (A and B) are compared, which are identical except for one variation that might affect a user's behavior), sessionization (Behavioral analytics focuses on how and why users behave by grouping events into sessions/ A session ID is created and stored every time a user visits your web page or mobile application.), bot detection (A bot is formed when a computer gets infected), and pathing analysis (statistics: directed dependencies among a set of variables/multiple regression analysis) all require powerful analytics on many petabytes of semistructured Web data. 45

Twitter source: What Twitter`s Made Of By Paul Ford, Bloomberg Businessweek, November 11-17, 2013. It`s short- 140 characters If you open one up one Tweet and look inside. You find (via an Application Programming Interface, API), e.g: -Identity of the creator (bot or human) -Location from which it originated -Date and time it went out -Number of people who read the tweet, fav`d a tweet, number of retweets, etc. we call it metadata You can access this info requesting a API key from Twitter (fast automated procedure), you get a Web address, and access it 46 as raw data for computers to read.

Twitter Example of metadata Coordinates part of the tweet: This value contains geographical information- latitude and longitude (in a format called GeoJSON-public open standard) Place part of the tweet: Specific, named locations. (multiple coordinates-polygons over the surface of the earth). 47

Twitter With this metadata (places and time), by applying some math one can reveal, for example how far one tweeter is from another learn when people are active in social media engagement 48

Twitter More metadata withheld copyright if set to true trouble over copyright withheld_in_countries list of countries in which the tweet is banned possibly_sensitive if set to true links to potentially offensive things : nudity, violence, or medical procedures (a user can check a box in his profile, automatically flagge 49

Example of search indexes (source https://cloudant.com/) Search indexes are defined by a javascript function. This is run over all of your documents, in a similar manner to a view's map function, and defines the fields that your search can query. A simple search function function(doc) { index("name", doc.name); } Defining an analyzer "indexes": { "mysearch" : { "analyzer": "whitespace", "index": "function(doc){... }" }, } 50

Sensor Data Analyze Machine and Sensor Data (source :http://hortonworks.com/hadoop-tutorial/how-to-analyze-machine-and-sensor-data/ ) A sensor is a device that measures a physical quantity and transforms it into a digital signal. Sensors are always on, capturing data at a low cost, and powering the Internet of Things. Sensors data: separating signal to noise ratio Potential Uses of Sensor Data Sensors can be used to collect data from many sources, such as: To monitor machines or infrastructure such as ventilation equipment, bridges, energy meters, or airplane engines. This data can be used for predictive analytics, to repair or replace these items before they break. To monitor natural phenomena such as meteorological patterns, underground pressure during oil extraction, or patient vital statistics during recovery from a medical procedure. 51

Raw data (source http://www.wisegeek.org/what-is-raw-data.htm) Raw data, also known as source data or atomic data, is information that has not been processed in order to be displayed in any sort of presentable form. The raw form may look very unrecognizable and be nearly meaningless without processing, but it may also be in a form that some can interpret, depending on the situation. This data can be processed manually or by a machine. In some cases, raw data may be nothing more than a series of numbers. The way those numbers are sequenced, however, and sometimes even the way they are spaced, can be very important information. A computer may interpret this information and give a readout that then may make sense to the reader. Binary code is a good example of raw data. Taken by itself as a printout, a binary code does very little for the computer user at least the vast majority of users. When it is processed through a computer, on the other hand, it provides more understandable information. In fact, binary code is typically the source code for everything a computer user sees. 52

Sensor data logged to a text file. Imported data into Excel (sourcememos From the Cube) 53

Data Challenges cont Velocity (reacting to the flood of information in the time required by the application) Stream computing: e.g. Show me all people who are currently living in the Bay Area flood zone - continuosly updated by GPS data in real time. (IBM) Challenge: the change of the data structure; the consumer has no longer control over the source of data creation; this requires the concept of late binding; it also poses a major challenge in regards to governance and data quality; with the shift of the transformation of data from ETL to at-timeof-consumption the ETL-knowledge must be give to every consumer; tools will have to help on that. -- Thomas, Fastner, ebay Combining multiple data sets 54

Data Challenges cont. Personally Identifiable Information much of this information is about people. Can we extract enough information to help people without extracting so much as to compromise their privacy? Partly, this calls for effective industrial practices. Partly, it calls for effective oversight by Government. Partly perhaps mostly it requires a realistic reconsideration of what privacy really means. (Paul Miller) right to be forgotten. 1,000 a day ask Google to remove search links (145,000 requests have been made in the European Union covering 497,000+ web links) 55

Data Challenges cont. Data dogmatism analysis of big data can offer quite remarkable insights, but we must be wary of becoming too beholden to the numbers. Domain experts and common sense must continue to play a role. e.g. It would be worrying if the healthcare sector only responded to flu outbreaks when Google Flu Trends told them to. (Paul Miller) 56

Process Challenges The challenges with deriving insight include - Capturing data, - Aligning data from different sources (e.g., resolving when two objects are the same), - Transforming the data into a form suitable for analysis, - Modeling it, whether mathematically, or through some form of simulation, - Understanding the output visualizing and sharing the results, (Laura Haas, IBM Research) 57

Management Challenges Data Privacy, Security, and Governance. - ensuring that data is used correctly (abiding by its intended uses and relevant laws), - tracking how the data is used, transformed, derived, etc, - and managing its lifecycle. Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits (Michael Blaha). 58

Big Data: Data Platforms In the Big Data era the old paradigm of shipping data to the application isn`t working any more. Rather, the application logic must come to the data or else things will break: this is counter to conventional wisdom and the established notion of strata within the database stack. Hadoop: Processing moves to where the data is! Data management With terabytes, things are actually pretty simple -- most conventional databases scale to terabytes these days. However, try to scale to petabytes and it`s a whole different ball game. (Florian Waas, previously at Pivotal) 59

Big Data Analytics In order to analyze Big Data, the current state of the art is a parallel database or NoSQL data store, with a Hadoop connector. Concerns about performance issues arising with the transfer of large amounts of data between the two systems. The use of connectors could introduce delays, data silos, increase TCO. What about existing Data Warehouses? 60

Which Analytics Platform for Big Data? NoSQL (document store, key-value store, ) NewSQL InMemory DB Hadoop Data Warehouses Plus scripts, workflows, and ETL-like data transformations.are we going back to Federated Databases? This just seems like too many moving parts. 61

http://blogs.teradata.com/data-points/tag/hadoop/page/2/ 62

http://vision.cloudera.com/cloudera-connect-the-blueprint-to-an-information-driven-enterprise/ 63

High Performance High Functionality Big Data Software Stack - Geoffrey Fox, Judy Qiu, Shantenu Jha, Indiana and Rutgers University http://www.exascale.org/bdec/sites/www.e xascale.org.bdec/files/whitepapers/fo x.pdf 64

http://www.analytics-tools.com/p/home.html 65

Build your own database Spanner: Google s Globally-Distributed Database Spanner is Google s scalable, multi-version, globally- distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. Spanner: Google's Globally-Distributed Database Published in the Proceedings of OSDI'12: Tenth Symposium on Operating System Design and Implementation, Hollywood, CA, October, 2012. Recipient of the Jay Lepreau Best Paper Award. 66

Google AdWords Ecosystem One shared database backing Google's core AdWords business Legacy DB: Sharded MySQL Critical applications driving Google's core ad business 24/7 availability, even with data center outages Consistency required Can't afford to process inconsistent data Eventual consistency too complex and painful Scale: 10s of TB, replicated to 1000s of machines F1: A new database, built from scratch, designed to operate at Google scale, without compromising on RDBMS features. Co-developed with new lower-level storage system, Spanner Better scalability Better availability Equivalent consistency guarantees Equally powerful SQL query www.stanford.edu/class/cs347/slides/f1.pdf 67

Google F1 - A Hybrid Database Scalability of Bigtable F1 - A Hybrid Database combining the Usability and functionality of SQL databases Key Ideas Scalability: Auto-sharded storage Availability & Consistency: Synchronous replication High commit latency: Can be hidden Hierarchical schema Protocol buffer column types Efficient client code A scalable database without going NoSQL. F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business Jeff Shute, Mircea Oancea, Stephan Ellner, Ben Handy, Eric Rollins, Bart Samwel, Radek Vingralek, Chad Whipkey, Xin Chen, Beat Jegerlehner, Kyle Littlefield, Phoenix Tong SIGMOD May 22, 2012 68

Hadoop Limitations Hadoop can give powerful analysis, but it is fundamentally a batch-oriented paradigm. The missing piece of the Hadoop puzzle is accounting for real time changes. Apache Hadoop YARN (MapReduce 2.0 (MRv2)) is a sub-project of Hadoop at the Apache Software Foundation that takes Hadoop beyond batch to enable broader data-processing. 69

Replacing Hadoop Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley (https://spark.apache.org) Databricks was founded out of the UC Berkeley AMPLab by the creators of Apache Spark. A unified platform for building Big Data pipelines from ETL to Exploration and Dashboards, to Advanced Analytics and Data Products. The Stratosphere project (TU Berlin, Humboldt University, Hasso Plattner Institute) (www.stratosphere.eu) contributes to Apache Flink is a platform for efficient, distributed, general-purpose data processing. flink.incubator.apache.org The ASTERIX project (UC Irvine- started 2009) http://asterix.ics.uci.edu Four years of R&D involving researchers at UC Irvine, UC Riverside, and Oracle Labs. The AsterixDB code base currently consists of over 250K lines of Java code that has been co-developed by project staff and students at UCI and UCR.opensource Apache-style licence 70

Which Language for Analytics? There is a trend in using SQL for analytics and integration of data stores. (e.g. SQL-H, Teradata QueryGrid) Is this good? 71

Graphs and Big Data (sources: http://www.graphanalysis.org/sc12/02_feo.pdf) (http://neo4j.com/developer/graph-database/) The breadth of problems requiring graph analytics is growing rapidly Large Network Systems Social Networks Packet Inspection Natural Language Understanding Semantic Search and Knowledge Discovery CyberSecurity 72

NoSQL graph databases examples Neo4j, InfiniteGraph, AllegroGraph Data Model: Nodes and Relationships (http://neo4j.com/de veloper/graphdatabase/)

Hadoop Benchmarks Quantitatively evaluate and characterize the Hadoop deployment through benchmarking HiBench: A Representative and Comprehensive Hadoop Benchmark Suite Intel Asia-Pacific Research and Development Ltd THE HIBENCH SUITE HiBench -- benchmark suite for Hadoop, consists of a set of Hadoop programs including both synthetic micro-benchmarks and real-world applications. Micro Benchmarks : Sort, WordCount, TeraSort, EnhancedDFSIO Web Search : Nutch Indexing, Page Rank Machine Learning: Bayesian Classification, K-means Clustering Analytical Query : Hive Join, Hive Aggregation 74

Big Data Benchmarks TPC launched TPCx-HS: industry s first standard for benchmarking big data systems, is designed to provide metric and methodologies to enable fair comparisons of systems from various vendors -- Raghunath Nambiar (CISCO), chairman of the TPC big data committee, August 18, 2014. 75

Big Data and the Cloud What about traditional enterprises? Very early adoption for analytics In general people are concerned with the protection and security of their data. Hadoop in the cloud: Amazon has a significant webservices business around Hadoop. 76

Big Data for the Common Good Very few people seem to look at how Big Data can be used for solving social problems. Most of the work in fact is not in this direction. Why this? Lack of obvious economic and personal incentives What can be done in the international research and development communities to make sure that some of the most brilliant ideas do have an impact also for social issues? 77

Big Data for the Common Good As more data become less costly and technology breaks barrier to acquisition and analysis, the opportunity to deliver actionable information for civic purposed grow. This might be termed the common good challenge for Big Data. (Jake Porway, DataKind) 78