Big Data Introducción Santiago González <sgonzalez@fi.upm.es>
Contenidos Por que BIG DATA? Características de Big Data Tecnologías y Herramientas Big Data Paradigmas fundamentales Big Data Data Mining Visualización DIAPOSITIVA 1
Por qué BIG DATA? We are drawing on data but starving on knowledge!! DIAPOSITIVA 2
Por qué BIG DATA? The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 3 DIAPOSITIVA 3
Quien genera y usa datos? Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion DIAPOSITIVA 4
Evolución OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) DIAPOSITIVA 5
Big Data Big data refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities (zdnet.com) The big deal about big data is the potential for getting more value more quickly from more data, at a lower cost and with greater agility. (Brian Hopkins, zdnet) DIAPOSITIVA 6
Big Data Big Data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it DIAPOSITIVA 7
Características de Big Data DIAPOSITIVA 8
Características de Big Data: Volume Data Volume 44x increase from 2009 2020 From 0.8 zettabytes to 35zb Data volume is increasing exponentially Exponential increase in collected/generated data DIAPOSITIVA 9
Características de Big Data: Varity Various formats, types, and structures Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc Static data vs. streaming data A single application can be generating/collecting many types of data To extract knowledge all these types of data need to linked together DIAPOSITIVA 10
Características de Big Data: Velocity Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction DIAPOSITIVA 11
Big Data: 3V s DIAPOSITIVA 12
Incluso 4V s! DIAPOSITIVA 13
Big Data Bubble? Big Data Gartner VP says Big Data is Falling into the Trough of Disillusionment, Jan 2013 Gartner Hype Cycle 2013 KDnuggets DIAPOSITIVA 14
Retos The Bottleneck is in technology New architecture, algorithms, techniques are needed Also in technical skills Experts in using the new technology and dealing with big data DIAPOSITIVA 15
Tecnologías y Herramientas Big Data DIAPOSITIVA 16
Arquitectura DIAPOSITIVA 18
Paradigmas fundamentales MapReduce DIAPOSITIVA 19
Paradigmas fundamentales Teorema CAP DIAPOSITIVA 20
Statistics Business Intelligence Data mining Knowledge Discovery in Data (KDD) Predictive Analytics Business Analytics Data Science Data Analytics Same Core Idea: Finding Useful Patterns in Data Different Emphasis DIAPOSITIVA 21
Data Mining DIAPOSITIVA 22
Por qué? Lots of data is being collected and warehoused Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management) DIAPOSITIVA 23
Por qué? Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in Hypothesis Formation DIAPOSITIVA 24
Qué es? Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns DIAPOSITIVA 25
Draws ideas from machine learning/ai, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Origenes Statistics/ AI Data Mining Database systems Machine Learning/ Pattern Recognition DIAPOSITIVA 26
CRISP-DM Why Should There be a Standard Process? The data mining process must be reliable and repeatable by people with little data mining background. DIAPOSITIVA 27
CRISP-DM Why Should There be a Standard Process? Allows projects to be replicated Aid to project planning and management Allows the scalability of new algorithms DIAPOSITIVA 28
CRoss-Industry Standard Process for Data Mining The CRISP-DM Model: The New Blueprint for DataMining, Colin Shearer, JOURNAL of Data Warehousing, Volume 5, Number 4, p. 13-22, 2000 DIAPOSITIVA 29
CRISP-DM DIAPOSITIVA 30
CRISP-DM Business Understanding: Project objectives and requirements understanding, Data mining problem definition Data Understanding: Initial data collection and familiarization, Data quality problems identification Data Preparation: Table, record and attribute selection, Data transformation and cleaning Modeling: Modeling techniques selection and application, Parameters calibration Evaluation: Business objectives & issues achievement evaluation Deployment: Result model deployment, Repeatable data mining process implementation DIAPOSITIVA 31
CRISP-DM Business Understanding Data Understanding Data Preparation Modeling Deployment Evaluation Format Data Integrate Data Construct Data Clean Data Select Data Determine Business Objectives Review Project Produce Final Report Plan Monitering & Maintenance Plan Deployment Determine Next Steps Review Process Evaluate Results Assess Model Build Model Generate Test Design Select Modeling Technique Assess Situation Explore Data Describe Data Collect Initial Data Determine Data Mining Goals Verify Data Quality Produce Project Plan DIAPOSITIVA 32
CRISP-DM Business Understanding and Data Understanding DIAPOSITIVA 33
CRISP-DM Knowledge acquisition techniques Knowledge Acquisition, Representation, and Reasoning Turban, Aronson, and Liang, Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, 2005 DIAPOSITIVA 34
DM Tools Open Source Weka Orange R-Project KNIME Commercial SPSS Clementine SAS Miner Matlab DIAPOSITIVA 35
Weka 3.6 DM Tools Java Excellent library, regular interface http://www.cs.waikato.ac.nz/ml/weka/ Orange R-Project KNIME DIAPOSITIVA 36
Weka 3.6 Orange DM Tools C++ and Python Regular library!, good interface http://orange.biolab.si/ R-Project KNIME DIAPOSITIVA 37
Weka 3.6 Orange R-Project DM Tools Similar than Matlab and Maple Powerfull libraries, Regular interface. Too slow for file access! http://cran.es.r-project.org/ KNIME DIAPOSITIVA 38
Weka 3.6 Orange R-Project KNIME DM Tools Java Includes Weka, Python and R-Project Powerfull libraries, good interface http://www.knime.org/download-desktop DIAPOSITIVA 39
DM Tools Let s go to install KNIME!! DIAPOSITIVA 40
Visualización DIAPOSITIVA 41
Visualización DIAPOSITIVA 42
Big Data Introducción Santiago González <sgonzalez@fi.upm.es>