Big Data. Introducción. Santiago González <sgonzalez@fi.upm.es>

Similar documents
Big Data Explained. An introduction to Big Data Science.

Transforming the Telecoms Business using Big Data and Analytics

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Applications for Big Data Analytics

Database Marketing, Business Intelligence and Knowledge Discovery

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Mining and Machine Learning in Bioinformatics

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Introduction to Data Mining

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

Information Management course

Integrating a Big Data Platform into Government:

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

An Introduction to Data Mining

Impact of Big Data in Oil & Gas Industry. Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India.

Foundations of Artificial Intelligence. Introduction to Data Mining

Analytics A survey on analytic usage, trends, and future initiatives. Research conducted and written by:

Data Mining. Yeow Wei Choong Anne Laurent

DATA MINING AND WAREHOUSING CONCEPTS

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction. A. Bellaachia Page: 1

Introduction to Data Mining

Sunnie Chung. Cleveland State University

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

How To Learn To Use Big Data

International Journal of Innovative Research in Computer and Communication Engineering

Big Data & Security. Aljosa Pasic 12/02/2015

A Review of Data Mining Techniques

Use of Data Mining in the field of Library and Information Science : An Overview

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Timo Elliott VP, Global Innovation Evangelist SAP SE or an SAP affiliate company. All rights reserved. 1

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

DATA ANALYSIS USING BUSINESS INTELLIGENCE TOOL. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

CS590D: Data Mining Chris Clifton

Timo Elliott VP, Global Innovation Evangelist SAP SE or an SAP affiliate company. All rights reserved. 1

Knowledge Discovery Process and Data Mining - Final remarks

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Data Mining and Business Intelligence CIT-6-DMB. Faculty of Business 2011/2012. Level 6

Introduction to Data Mining

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Big Data and Data Science: Behind the Buzz Words

THE COMPARISON OF DATA MINING TOOLS

DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

CHAPTER 1 INTRODUCTION

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

SPATIAL DATA CLASSIFICATION AND DATA MINING

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

The basic data mining algorithms introduced may be enhanced in a number of ways.

Ramesh Bhashyam Teradata Fellow Teradata Corporation

Big Data and Semantic Web in Manufacturing. Nitesh Khilwani, PhD Chief Engineer, Samsung Research Institute Noida, India

Data Mining Solutions for the Business Environment

Data Warehousing and Data Mining for improvement of Customs Administration in India. Lessons learnt overseas for implementation in India

Data Mining: An Introduction

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Analytics Industry Trends Survey. Research conducted and written by:

CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining. Chengkai Li University of Texas at Arlington Spring 2016

The Future of Business Analytics is Now! 2013 IBM Corporation

Data Mining System, Functionalities and Applications: A Radical Review

Data Mining for Successful Healthcare Organizations

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

BIG DATA. Value 8/14/2014 WHAT IS BIG DATA? THE 5 V'S OF BIG DATA WHAT IS BIG DATA?

Student Handbook Master of Information Systems Management (MISM)

DATA MINING ALPHA MINER

Information Visualization WS 2013/14 11 Visual Analytics

ANALYTICS CENTER LEARNING PROGRAM

Modern (Computational) Approaches to Big Data Analytics. CSC 576 Computer Science, University of Rochester Instructor: Ji Liu

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Easy Execution of Data Mining Models through PMML

Data Warehousing and Data Mining

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Advanced analytics at your hands

Big Data and Analytics: Challenges and Opportunities

Introduction to Data Mining

Introduction to Data Mining

Data and Machine Architecture for the Data Science Lab Workflow Development, Testing, and Production for Model Training, Evaluation, and Deployment

DATA MINING - SELECTED TOPICS

ANALYTICS BUILT FOR INTERNET OF THINGS

Megaputer Intelligence

COMP9321 Web Application Engineering

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Value of. Clinical and Business Data Analytics for. Healthcare Payers NOUS INFOSYSTEMS LEVERAGING INTELLECT

Transcription:

Big Data Introducción Santiago González <sgonzalez@fi.upm.es>

Contenidos Por que BIG DATA? Características de Big Data Tecnologías y Herramientas Big Data Paradigmas fundamentales Big Data Data Mining Visualización DIAPOSITIVA 1

Por qué BIG DATA? We are drawing on data but starving on knowledge!! DIAPOSITIVA 2

Por qué BIG DATA? The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 3 DIAPOSITIVA 3

Quien genera y usa datos? Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion DIAPOSITIVA 4

Evolución OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) DIAPOSITIVA 5

Big Data Big data refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities (zdnet.com) The big deal about big data is the potential for getting more value more quickly from more data, at a lower cost and with greater agility. (Brian Hopkins, zdnet) DIAPOSITIVA 6

Big Data Big Data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it DIAPOSITIVA 7

Características de Big Data DIAPOSITIVA 8

Características de Big Data: Volume Data Volume 44x increase from 2009 2020 From 0.8 zettabytes to 35zb Data volume is increasing exponentially Exponential increase in collected/generated data DIAPOSITIVA 9

Características de Big Data: Varity Various formats, types, and structures Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc Static data vs. streaming data A single application can be generating/collecting many types of data To extract knowledge all these types of data need to linked together DIAPOSITIVA 10

Características de Big Data: Velocity Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction DIAPOSITIVA 11

Big Data: 3V s DIAPOSITIVA 12

Incluso 4V s! DIAPOSITIVA 13

Big Data Bubble? Big Data Gartner VP says Big Data is Falling into the Trough of Disillusionment, Jan 2013 Gartner Hype Cycle 2013 KDnuggets DIAPOSITIVA 14

Retos The Bottleneck is in technology New architecture, algorithms, techniques are needed Also in technical skills Experts in using the new technology and dealing with big data DIAPOSITIVA 15

Tecnologías y Herramientas Big Data DIAPOSITIVA 16

Arquitectura DIAPOSITIVA 18

Paradigmas fundamentales MapReduce DIAPOSITIVA 19

Paradigmas fundamentales Teorema CAP DIAPOSITIVA 20

Statistics Business Intelligence Data mining Knowledge Discovery in Data (KDD) Predictive Analytics Business Analytics Data Science Data Analytics Same Core Idea: Finding Useful Patterns in Data Different Emphasis DIAPOSITIVA 21

Data Mining DIAPOSITIVA 22

Por qué? Lots of data is being collected and warehoused Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management) DIAPOSITIVA 23

Por qué? Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in Hypothesis Formation DIAPOSITIVA 24

Qué es? Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns DIAPOSITIVA 25

Draws ideas from machine learning/ai, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Origenes Statistics/ AI Data Mining Database systems Machine Learning/ Pattern Recognition DIAPOSITIVA 26

CRISP-DM Why Should There be a Standard Process? The data mining process must be reliable and repeatable by people with little data mining background. DIAPOSITIVA 27

CRISP-DM Why Should There be a Standard Process? Allows projects to be replicated Aid to project planning and management Allows the scalability of new algorithms DIAPOSITIVA 28

CRoss-Industry Standard Process for Data Mining The CRISP-DM Model: The New Blueprint for DataMining, Colin Shearer, JOURNAL of Data Warehousing, Volume 5, Number 4, p. 13-22, 2000 DIAPOSITIVA 29

CRISP-DM DIAPOSITIVA 30

CRISP-DM Business Understanding: Project objectives and requirements understanding, Data mining problem definition Data Understanding: Initial data collection and familiarization, Data quality problems identification Data Preparation: Table, record and attribute selection, Data transformation and cleaning Modeling: Modeling techniques selection and application, Parameters calibration Evaluation: Business objectives & issues achievement evaluation Deployment: Result model deployment, Repeatable data mining process implementation DIAPOSITIVA 31

CRISP-DM Business Understanding Data Understanding Data Preparation Modeling Deployment Evaluation Format Data Integrate Data Construct Data Clean Data Select Data Determine Business Objectives Review Project Produce Final Report Plan Monitering & Maintenance Plan Deployment Determine Next Steps Review Process Evaluate Results Assess Model Build Model Generate Test Design Select Modeling Technique Assess Situation Explore Data Describe Data Collect Initial Data Determine Data Mining Goals Verify Data Quality Produce Project Plan DIAPOSITIVA 32

CRISP-DM Business Understanding and Data Understanding DIAPOSITIVA 33

CRISP-DM Knowledge acquisition techniques Knowledge Acquisition, Representation, and Reasoning Turban, Aronson, and Liang, Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, 2005 DIAPOSITIVA 34

DM Tools Open Source Weka Orange R-Project KNIME Commercial SPSS Clementine SAS Miner Matlab DIAPOSITIVA 35

Weka 3.6 DM Tools Java Excellent library, regular interface http://www.cs.waikato.ac.nz/ml/weka/ Orange R-Project KNIME DIAPOSITIVA 36

Weka 3.6 Orange DM Tools C++ and Python Regular library!, good interface http://orange.biolab.si/ R-Project KNIME DIAPOSITIVA 37

Weka 3.6 Orange R-Project DM Tools Similar than Matlab and Maple Powerfull libraries, Regular interface. Too slow for file access! http://cran.es.r-project.org/ KNIME DIAPOSITIVA 38

Weka 3.6 Orange R-Project KNIME DM Tools Java Includes Weka, Python and R-Project Powerfull libraries, good interface http://www.knime.org/download-desktop DIAPOSITIVA 39

DM Tools Let s go to install KNIME!! DIAPOSITIVA 40

Visualización DIAPOSITIVA 41

Visualización DIAPOSITIVA 42

Big Data Introducción Santiago González <sgonzalez@fi.upm.es>