Technologies and algorithms to

Similar documents

A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes

Unified Big Data Processing with Apache Spark. Matei

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Brave New World: Hadoop vs. Spark

Big Data Analytics. Lucas Rego Drumond

CSE-E5430 Scalable Cloud Computing Lecture 2

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Application Development. A Paradigm Shift

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Large-Scale Data Processing

Hadoop Ecosystem B Y R A H I M A.

Spark and the Big Data Library

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

COMP9321 Web Application Engineering

Machine Learning over Big Data

Big Data Analytics Hadoop and Spark

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

MapReduce and Hadoop Distributed File System

Big Data: Opportunities & Challenges, Myths & Truths 資料來源 : 台大廖世偉教授課程資料

Ali Ghodsi Head of PM and Engineering Databricks

BIG DATA What it is and how to use?

Big Data With Hadoop

Big Data Explained. An introduction to Big Data Science.

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

MapReduce and Hadoop Distributed File System V I J A Y R A O

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Big Data Processing. Patrick Wendell Databricks

SEIZE THE DATA SEIZE THE DATA. 2015

Big Data and Analytics: Challenges and Opportunities

Scaling Out With Apache Spark. DTL Meeting Slides based on

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop Parallel Data Processing

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Big Data. Lyle Ungar, University of Pennsylvania

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

How To Create A Data Visualization With Apache Spark And Zeppelin

Data-Intensive Computing with Map-Reduce and Hadoop

Outline. What is Big data and where they come from? How we deal with Big data?

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Information Management course

The Stratosphere Big Data Analytics Platform

BIG DATA TRENDS AND TECHNOLOGIES

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Large scale processing using Hadoop. Ján Vaňo

Big Data and Scripting Systems beyond Hadoop

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Big Data a threat or a chance?

Streaming items through a cluster with Spark Streaming

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Architectures for Big Data Analytics A database perspective

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Challenges for Data Driven Systems

Fast Analytics on Big Data with H20

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Bringing Big Data Modelling into the Hands of Domain Experts

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Challenges in Bioinformatics

Spark. Fast, Interactive, Language- Integrated Cluster Computing

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Moving From Hadoop to Spark

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

How Companies are! Using Spark

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Dell In-Memory Appliance for Cloudera Enterprise

Advanced Big Data Analytics with R and Hadoop

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Bayesian networks - Time-series models - Apache Spark & Scala

Getting to Know Big Data

Hadoop and Map-Reduce. Swati Gore

Are You Ready for Big Data?

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

Spark: Making Big Data Interactive & Real-Time

NoSQL and Hadoop Technologies On Oracle Cloud

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data and Apache Hadoop s MapReduce

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

A Performance Analysis of Distributed Indexing using Terrier

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

Sunnie Chung. Cleveland State University

Open source Google-style large scale data analysis with Hadoop

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Big Data Analysis: Apache Storm Perspective

Beyond Hadoop with Apache Spark and BDAS

Big Data Systems CS 5965/6965 FALL 2015

Transcription:

Big Data: Technologies and algorithms to deal with challenges Francisco Herrera Research Group on Soft Computing and Information Intelligent Systems (SCI 2 S) Dept. of Computer Science and A.I. University of Granada, Spain Email: herrera@decsai.ugr.es http://sci2s.ugr.es

Big Data 2

Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases Big Data at SCI 2 S Final Comments. Challenges

Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases Big Data at SCI 2 S Final Comments. Challenges

What is Big Data? 3 Vs of Big Data volume Ej. Genomics Astronomy Transactions Gartner, Doug Laney, 2001

What is Big Data? 3 Vs of Big Data velocity Gartner, Doug Laney, 2001

What is Big Data? 3 Vs of Big Data variety Gartner, Doug Laney, 2001

What is Big Data? 3 Vs of Big Data Some Make it 4V s: Veracity

What is Big Data?

What is Big Data? No single standard definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big Data is data whose scale, diversity, and complexity require new architectures, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it

What is Big Data? (in short) Big data refers to any problem characteristic that represents a challenge to process it with traditional applications

(Big) Data Science Data Science combines the traditional scientific method with the ability to explore, learn and gain deep insighti for (Big) Data It is not just about finding patterns in data it is mainly about explaining those patterns

Data Science Process Data Prepr rocess sing Clean Sample Aggregate Imperfect data: missing, noise, Reduce dim.... > 70% time! rocess sing Data P Explore data Represent data Link data Learn from data Deliver insight Analy ytics Data Clustering Classification Regressionession Network analysis Visual analytics Association

Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases Big Data at SCI 2 S Final Comments. Challenges

Why Big Data? Scalability to large data volumes: Scan 100 TB on 1 node @ 50 MB/sec = 23 days Scan on 1000-node cluster = 33 minutes Divide-And-Conquer (i.e., data partitioning) A i l hi t l l f d t A single machine can not manage large volumes of data efficiently

Why Big Data? Scalability to large data volumes: Scan 100 TB on 1 node @ 50 MB/sec = 23 days Scan on 1000-node cluster = 33 minutes Divide-And-Conquer (i.e., data partitioning) How must we process 1000 TB or 10000 TB?

Why Big Data? a MapReduce Scalability to large data volumes: Scan 100 TB on 1 node @ 50 MB/sec = 23 days Scan on 1000-node cluster = 33 minutes Divide-And-Conquer (i.e., data partitioning) MapReduce Overview: Data-parallel programming model An associated parallel and distributed implementation for commodity clusters Pioneered by Google Processes 20 PB of data per day Popularized by open-source Hadoop project Used by Yahoo!, Facebook, Amazon, and the list is growing

MapReduce MapReduce is a popular approach to deal with Big Data Data Fragmentation with Parallel Processing + Model fusion

MapReduce MapReduce workflow Blocks/ Intermediary Output fragments Files Files The key of a MapReduce data fragmentation approach is usually on the reduce phase J. Dean, S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM 51 (1) (2008) 107-113.

MapReduce Experience Runs on large commodity clusters: 10s to 10,000s of machines Processes many terabytes of data Easy to use since run-time complexity hidden from the users Cost-efficiency: Commodity nodes (cheap, but unreliable) Commodity network Automatic fault-tolerance (fewer administrators) Easy to use (fewer programmers)

MapReduce. Hadoop Hadoop is an open source implementation of MapReduce computational paradigm Created by Doug Cutting (chairman of board of directors of the Apache Software Foundation, 2010) http://hadoop.apache.org/

MapReduce. Hadoop birth July 2008 - Hadoop Wins Terabyte Sort Benchmark One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual al general purpose pose (Daytona) terabyte short bechmark. This is the first time that either a Java or an open source program has won. http://developer.yahoo.com/blogs/hadoop/hadoopsorts-petabyte-16-25-hours-terabyte-62-422.html

MapReduce: Limitations If all you have is a hammer, then everything looks like a nail. The following types of algorithms are examples where MapReduce: Pregel (Google) Iterative Graph Algorithms: PageRank Gradient Descent Expectation Maximization

MapReduce: Limitations More than 10000 applications in Google Enrique Alfonseca Google Research Zurich 24

Hadoop On the limitations of Hadoop. New platforms GIRAPH (APACHE Project) (http://giraph.apache.org/) Procesamiento iterativo de grafos Twister (Indiana University) http://www.iterativemapreduce.org/ Clusters propios GPS - A Graph Processing System, (Stanford) http://infolab.stanford.edu/gps/ para Amazon's EC2 Distributed GraphLab (Carnegie Mellon Univ.) https://github.com/graphlab-code/graphlab Amazon's EC2 PrIter (University of Massachusetts Amherst, Northeastern University-China) http://code.google.com/p/priter/ Cluster propios y Amazon EC2 cloud HaLoop (University of Washington) http://clue.cs.washington.edu/node/14 cs edu/node/14 http://code.google.com/p/haloop/ Amazon s EC2 Spark (UC Berkeley) (100 times more efficient i than GPU based platforms Mars Grex Hadoop, including iterative algorithms, according to creators) http://spark.incubator.apache.org/research.html

Hadoop Ecosystem The project GIRAPH (APACHE Project) (http://giraph.apache.org/) Iterative Graphs Recently: Apache Spark http://hadoop.apache.org/ Spark (UC Berkeley) (100 times more efficient than Hadoop, including iterative algorithms, according to creators)

Spark Hadoop: In Memory oy Hadoop Evolution InMemory HDFS Hadoop + SPARK Ecosystem Apache Spark Future version of Mahout for Spark https://cwiki.apache.org/confluence/display/spark/powered+by+spark Databricks, Groupon, ebay inc., Amazon, Hitachi, Nokia, Yahoo!,

Spark Hadoop. Spark birth October 10, 2014 Using Spark on 206 EC2 nodes, we completed the benchmark in 23 minutes. This means that t Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark s inmemory cache. http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

Spark Hadoop. Spark birth http://sortbenchmark.org/

Why Big Data? MapReduce Paradigm. Hadoop Ecosystem (A Snapshot) Big Data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. There are two main issues related to Big Data: 1. Database/storage t frameworks: to write, read, and manage data. 2. Computational models: to process and analyze data. Recently, there are the following Big Data frameworks: 1. Storage frameworks: Google File System (GFS), Hadoop Distributed File Systems (HDFS). 2. Computational models: MapReduce (Apache Hadoop), Resilient Distributed Datasets (RDD by Apache Spark).

Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases Big Data at SCI 2 S Final Comments. Challenges

Big Data Analytics Potential scenarios: Clustering Classification Real Time Analytics/ Big Data Streams Association Recommendation Systems Social Media Mining Social Big Data

Big Data Analytics: Tools Generation 1st 2nd Generation 3nd Generation Generation Examples SAS, R, Weka, Mahout, Pentaho, Spark, Haloop, GraphLab, SPSS, KEEL Cascading Pregel, Giraph, ML over Storm Scalability Vertical Horizontal (over Horizontal (Beyond Hadoop) Hadoop) Algorithms Available Algorithms Not Available Fault- Tolerance Huge collection of algorithms Practically nothing Single point of failure Small subset: sequential logistic regression, linear SVMs, Stochastic Gradient Decendent, k- means clustsering, Random forest, etc. Vast no.: Kernel SVMs, Multivariate Logistic Regression, Conjugate Gradient Descendent, ALS, etc. Most tools are FT, as they are built on top of Hadoop Much wider: CGD, ALS, collaborative filtering, kernel SVM, matrix factorization, Gibbs sampling, etc. Multivariate logistic regression in general form, k-means clustering, etc. Work in progress to expand the set of available algorithms FT: HaLoop, Spark Not FT: Pregel, GraphLab, Giraph

Big Data Analytics: Tools Mahout MLlib https://spark.apache.org/mllib/ Version 1.4.1

Big Data Analytics: Tools MLlib https://spark.apache.org/mllib/

Big Data Analytics Case of Study: Random Forest for KddCup 99 The RF Mahout Partial implementation: is an algorithm that builds multiple trees for different portions of the data. Two phases: Building phase Classification phase 36

Big Data Analytics Case of Study: Random Forest for KddCup 99 Class Instance Number normal 972.781 DOS 3.883.370 PRB 41.102102 R2L 1.126 U2R 52

Big Data Analytics Case of Study: Random Forest for KddCup 99 Class Instance Number normal 972.781 DOS 3.883.370 PRB 41.102102 R2L 1.126 U2R 52 Cluster ATLAS: 16 nodes -Microprocessors: p 2 x Intel E5-2620 (6 cores/12 threads, 2 GHz) - RAM 64 GB DDR3 ECC 1600MHz -Mahout version 0.8

Big Data Analytics:Two last comments Image Credit: Shutterstock Without Analytics, Big Data is Just Noise Guest post by Eric Schwartzman, founder and CEO of Comply Socially http://www.briansolis.com/2013/04/without-analytics-big-data-is-just-noise/

Big Data Analytics:Two last comments Mahout Lack of big data preprocessing algorithms MLlib https://spark.apache.org/mllib/ Version 1.4.1

Big Data Analytics Data Preprocessing Big Data Data Science Model building Predictive and descriptive Analytics Lack of big data preprocessing algorithms

Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases. Health, Social Media and People shopping mall identification Big Data at SCI 2 S Final Comments. Challenges

Real Cases. Health HEALTHCARE ORGANIZATIONS HAVE COLLECTED EXABYTES OF DATA LAST YEARS Big data can facilitate action on the modifiable risk factors that contribute to a large fraction of the chronic disease burden, such as physical activity, diet, tobacco use, and exposure to pollution. Big Data, Vol.1, 3, 2013

Real Cases. Health and Social Media. Detect pandemic risk in real time Google Flu Trends (2008) (http://www.google.org/flutrends)) is a web service operated by Google. They could say, as the prevention and control centers, where the flu had spread but almost in real time, not one or two weeks. It provides estimates of influenza activity for more than 25 countries. ti By aggregating Google search queries (45 different terms) J. Ginsberg, M.H. Mohebbi, R.S. Patel, L. Brammer, M.S. Smolinski, L. Brilliant. Detecting influenza epidemics using search engine query data. Nature 475 (2009) 1012-1014

Real Cases. Health and Social Media. Detect pandemic risk in real time Google Flu J. Ginsberg, M.H. Mohebbi, R.S. Patel, L. Brammer, M.S. Smolinski, L. Brilliant. Detecting influenza epidemics using search engine query data. Nature 475 (2009) 1012-1014

Real Cases. Health and Social Media. Detect pandemic risk in real time Google Flu In 2013 he overestimated the levels of influenza (prevention and control centers estimate x2). The overestimation may be due to extensive media coverage of flu can modify behaviors search www.sciencemag.org SCIENCE VOL 343 14 MARCH 2014

Real Cases. Health and Social Media. Detect pandemic risk in real time Google Flu The models are updated annually 2014 model update applied to prior years The model has been extended to Dengue

Real Cases. Health and Social Media. Detect pandemic risk in real time Wikipedia is better than Google at tracking flu trends? Wikipedia traffic could be used to provide real time tracking of flu cases, according to the study published by John Brownstein. Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time David J. McIver, John S. Brownstein, Harvard Medical School, Boston Children s Hospital, Plos Computational Biology, 2014.

Real Cases. Health and Social Media You Are What You Tweet: Analyzing Twitter for Public Health Discovering Health Topics in Social Media Using Topic Models Michael J. Paul, Mark Dredze, Johns Hopkins University, Plos One, 2014 Analyzing user messages in social media can measure different population characteristics, including public health measures. A system filtering Twitter data can automatically infer health topics in 144 million Twitter messages from 2011 to 2013. ATAM discovered 13 coherent clusters of Twitter messages, some of which correlate with seasonal influenza (r = 0.689) and allergies (r = 0.810) temporal surveillance data, as well as exercise (r =.534) and obesity (r =2.631) related geographic survey data in the United States. 49

Real Cases. Shopping mall identification. 3 MONTHS OF CREDIT CARD RECORDS 11 MILLIONS OF PEOPLE http://www.sciencemag.org/content/347/6221/536 www.sciencemag.org SCIENCE VOL 347 January 30, 2015, pp. 536-539

Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases Big Data at SCI 2 S Final Comments. Challenges

Big Data at SCI 2 S - UGR Bird's eye view SCI 2 S website http://sci2s.ugr.es/bigdata

Big Data at SCI 2 S - UGR http://sci2s.ugr.es/bigdata Our approaches: Big Data with Fuzzy Models Fuzzy Rule Based System for classification Fuzzy Rule Based System with cost sensitive for imbalanced data sets https://github.com/saradelrio

Big Data at SCI 2 S - UGR http://sci2s.ugr.es/bigdata Our approaches: Imbalanced Big Data Preprocessing: Random undersampling, oversampling Cost sensitive https://github.com/saradelrio Bioinformatic Applications

Big Data at SCI 2 S - UGR http://sci2s.ugr.es/bigdata Our approaches: Big Data Preprocessing Evolutionary data reduction https://github.com/triguero Feature selection and discretization https://github.com/sramirez

ECBDL 14 Big Data Competition Vancouver, 2014 ECBDL 14 Big Data Competition 2014: Self-deployment track Objective: Contact map prediction Details: 32 million instances 631 attributes (539 real & 92 nominal values) 2 classes 98% of negative examples About 56.7GB of disk space Evaluation: True positive rate True negative rate TPR TNR http://cruncher.ncl.ac.uk/bdcomp/index.pl?action=data J. Bacardit et al, Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features, Bioinformatics 28 (19) (2012) 2441-2448

Evolutionary Computation for Big Data and Big Learning Workshop ECBDL 14 Big Data Competition 2014: Self-deployment track The challenge: Very large size of the training set Does not fit all together in memory. Even large for the test set (5.1GB, 2.9 million instances) Relatively high dimensional data. Low ratio (<2%) of true contacts. t Imbalance rate: >49 Imbalanced problem! Imbalanced Big Data Classification --- - - - - - - - - - - - - -- + + + + +

Imbalanced Big Data Classification A MapReduce Approach 32 million instances, 98% of negative examples Low ratio of true contacts (<2%). Imbalance rate: > 49. Imbalanced problem! Previous study on extremely imbalanced big data: S. Río, V. López, J.M. Benítez, F. Herrera, On the use of MapReduce for Imbalanced Big Data using Random Forest. Information Sciences 285 (2014) 112-137. Over-Sampling Random Focused Under-Sampling Random Focused Cost Modifying i (cost-sensitive) Boosting/Bagging approaches (with preprocessing)

ECBDL 14 Big Data Competition Our approach: ECBDL 14 Big Data Competition 2014 1. Balance the original training data Random Oversampling (As first idea, it was extended) 2 Learning a model 2. Learning a model. Random Forest

ECBDL 14 Big Data Competition Oversampling rate: {100%} RandomForest

ECBDL 14 Big Data Competition

ECBDL 14 Big Data Competition Our approach: ECBDL 14 Big Data Competition 2014 1. Balance the original training data Random Oversampling incresing the ROS percentage (As first idea, it was extended) 2. Learning a model. Random Forest 3. Detect relevant features. es 1. Evolutionary Feature Weighting Classifying test set.

ECBDL 14 Big Data Competition How to increase the performance? Third component: MapReduce Approach for Feature Weighting for getting a major performance over classes

ECBDL 14 Big Data Competition Last decision: We investigated to increase ROS until 180% with 64 mappers 64 mappers Algorithms TNR*TPR Training TPR TNR TNR*TPR Test ROS+ RF (130%+ FW 90+25f+200t) 0.736987 0.671279 0.783911 0.526223 ROS+ RF (140%+ FW 90+25f+200t) 0.717048 0.695109 0.763951 0.531029 ROS+ RF (150%+ FW 90+25f+200t) 0.706934 0.705882 0.753625 0.531971 ROS+ RF (160%+ FW 90+25f+200t) 0,698769 0.718692 0.741976 0.533252 ROS+ RF (170%+ FW 90+25f+200t) 0.682910 0.730432 0.730183 0.533349 ROS+ RF (180%+ FW 90+25f+200t) 0,678986 0.737381 0.722583 0.532819 To increase ROS and reduce the mappers number lead us to get a trade-off with good results ROS 170 85 replications of the minority instances

ECBDL 14 Big Data Competition Evolutionary Computation for Big Data and Big Learning Workshop Results of the competition: Contact map prediction TPR Team Name TPR TNR Acc TNR Efdamis 0.730432 0.730183 0.730188 0.533349 ICOS 0.703210 0.730155 0.729703 0.513452 UNSW 0.699159 0.727631 0.727153 0.508730 HyperEns 0.640027 0.763378 0.761308 0.488583 PUC-Rio_ICA 0.657092 0.714599 0.713634 0.469558 Test2 0.632009 0.735545 0.733808 0.464871 EmeraldLogic 0.686926 0.669737 0.670025 0.460059 EFDAMIS team ranked first in the ECBDL 14 big data competition LidiaGroup 0.653042 0.695753 0.695036 0.454356 http://cruncher.ncl.ac.uk/bdcomp/index.pl?action=ranking

ECBDL 14 Big Data Competition Our algorithm: ROSEFW-RF MapReduce Process Iterative MapReduce Process I. Triguero, S. del Río, V. López, J. Bacardit, J.M. Benítez, F. Herrera. ROSEFW-RF: The winner algorithm for the ECBDL'14 Big Data Competition: An extremely imbalanced big data bioinformatics problem imbalanced big data bioinformatics problem. Knowledge-Based Systems, Volume 87, October 2015, Pages 69 79 http://www.sciencedirect.com/science/article/pii/s0950705115002166 https://github.com/triguero/rosefw-rf

Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases Big Data at SCI 2 S Final Comments. Challenges

Final Comments Parallelization of machine learning algorithms with datapartitioning approaches can be successfully performed with MapReduce. Partitioning and applying the machine learning algorithm to each part. Focus on the combination phase (reduce). The combination of models is the challenge for each algorithm. Data Mining, Machine learning and data preprocessing: Huge collection of algorithms against big data analytics algorithms

Final Comments Data Mining, i Machine learning and data preprocessing: Huge collection of algorithms Big Data Analytics Big Data: A small subset of algorithms Big Data Preprocessing: A few methods for preprocessing in Big Data analytics. Deep learning: Neural networks based approach for signal/image processing in big data.

Final Comments Big data and analytics: a large challenge offering great opportunities A small subset of algorithms. It is necessary to redesign new algorithms Computing Model Accuracy and Approximation Efficiency requirements for Algorithm Quality data for quality models in big data analytics Quality decisions must be based on quality big data! Big Data Preprocessing Noise in data distorts Need automatic methods for cleaning the data Missing values management Big Data Reduction

Final Comments Where we are going: 3BigDatastages http://www.kdnuggets.com/2013/12/3-stages-big-data.html By Gregory Piatetsky, Dec 8, 2013. Big Data 3.0: Intelligent Google Now, Watson (IBM)... Big Data 3.0 would be a combination of data, with huge knowledge bases and a very large collection of algorithms, perhaps reaching the level of true Artificial Intelligence (Singularity?).

Final Comments

Thanks!!! Big Data: Technologies Big Data: Technologies and Applications