Technologies and algorithms to
|
|
|
- Byron Harmon
- 10 years ago
- Views:
Transcription
1 Big Data: Technologies and algorithms to deal with challenges Francisco Herrera Research Group on Soft Computing and Information Intelligent Systems (SCI 2 S) Dept. of Computer Science and A.I. University of Granada, Spain [email protected]
2 Big Data 2
3 Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases Big Data at SCI 2 S Final Comments. Challenges
4 Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases Big Data at SCI 2 S Final Comments. Challenges
5 What is Big Data? 3 Vs of Big Data volume Ej. Genomics Astronomy Transactions Gartner, Doug Laney, 2001
6 What is Big Data? 3 Vs of Big Data velocity Gartner, Doug Laney, 2001
7 What is Big Data? 3 Vs of Big Data variety Gartner, Doug Laney, 2001
8 What is Big Data? 3 Vs of Big Data Some Make it 4V s: Veracity
9 What is Big Data?
10 What is Big Data? No single standard definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big Data is data whose scale, diversity, and complexity require new architectures, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it
11 What is Big Data? (in short) Big data refers to any problem characteristic that represents a challenge to process it with traditional applications
12 (Big) Data Science Data Science combines the traditional scientific method with the ability to explore, learn and gain deep insighti for (Big) Data It is not just about finding patterns in data it is mainly about explaining those patterns
13 Data Science Process Data Prepr rocess sing Clean Sample Aggregate Imperfect data: missing, noise, Reduce dim.... > 70% time! rocess sing Data P Explore data Represent data Link data Learn from data Deliver insight Analy ytics Data Clustering Classification Regressionession Network analysis Visual analytics Association
14 Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases Big Data at SCI 2 S Final Comments. Challenges
15 Why Big Data? Scalability to large data volumes: Scan 100 TB on 1 50 MB/sec = 23 days Scan on 1000-node cluster = 33 minutes Divide-And-Conquer (i.e., data partitioning) A i l hi t l l f d t A single machine can not manage large volumes of data efficiently
16 Why Big Data? Scalability to large data volumes: Scan 100 TB on 1 50 MB/sec = 23 days Scan on 1000-node cluster = 33 minutes Divide-And-Conquer (i.e., data partitioning) How must we process 1000 TB or TB?
17 Why Big Data? a MapReduce Scalability to large data volumes: Scan 100 TB on 1 50 MB/sec = 23 days Scan on 1000-node cluster = 33 minutes Divide-And-Conquer (i.e., data partitioning) MapReduce Overview: Data-parallel programming model An associated parallel and distributed implementation for commodity clusters Pioneered by Google Processes 20 PB of data per day Popularized by open-source Hadoop project Used by Yahoo!, Facebook, Amazon, and the list is growing
18 MapReduce MapReduce is a popular approach to deal with Big Data Data Fragmentation with Parallel Processing + Model fusion
19 MapReduce MapReduce workflow Blocks/ Intermediary Output fragments Files Files The key of a MapReduce data fragmentation approach is usually on the reduce phase J. Dean, S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM 51 (1) (2008)
20 MapReduce Experience Runs on large commodity clusters: 10s to 10,000s of machines Processes many terabytes of data Easy to use since run-time complexity hidden from the users Cost-efficiency: Commodity nodes (cheap, but unreliable) Commodity network Automatic fault-tolerance (fewer administrators) Easy to use (fewer programmers)
21 MapReduce. Hadoop Hadoop is an open source implementation of MapReduce computational paradigm Created by Doug Cutting (chairman of board of directors of the Apache Software Foundation, 2010)
22 MapReduce. Hadoop birth July Hadoop Wins Terabyte Sort Benchmark One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual al general purpose pose (Daytona) terabyte short bechmark. This is the first time that either a Java or an open source program has won.
23 MapReduce: Limitations If all you have is a hammer, then everything looks like a nail. The following types of algorithms are examples where MapReduce: Pregel (Google) Iterative Graph Algorithms: PageRank Gradient Descent Expectation Maximization
24 MapReduce: Limitations More than applications in Google Enrique Alfonseca Google Research Zurich 24
25 Hadoop On the limitations of Hadoop. New platforms GIRAPH (APACHE Project) ( Procesamiento iterativo de grafos Twister (Indiana University) Clusters propios GPS - A Graph Processing System, (Stanford) para Amazon's EC2 Distributed GraphLab (Carnegie Mellon Univ.) Amazon's EC2 PrIter (University of Massachusetts Amherst, Northeastern University-China) Cluster propios y Amazon EC2 cloud HaLoop (University of Washington) cs edu/node/14 Amazon s EC2 Spark (UC Berkeley) (100 times more efficient i than GPU based platforms Mars Grex Hadoop, including iterative algorithms, according to creators)
26 Hadoop Ecosystem The project GIRAPH (APACHE Project) ( Iterative Graphs Recently: Apache Spark Spark (UC Berkeley) (100 times more efficient than Hadoop, including iterative algorithms, according to creators)
27 Spark Hadoop: In Memory oy Hadoop Evolution InMemory HDFS Hadoop + SPARK Ecosystem Apache Spark Future version of Mahout for Spark Databricks, Groupon, ebay inc., Amazon, Hitachi, Nokia, Yahoo!,
28 Spark Hadoop. Spark birth October 10, 2014 Using Spark on 206 EC2 nodes, we completed the benchmark in 23 minutes. This means that t Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark s inmemory cache.
29 Spark Hadoop. Spark birth
30 Why Big Data? MapReduce Paradigm. Hadoop Ecosystem (A Snapshot) Big Data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. There are two main issues related to Big Data: 1. Database/storage t frameworks: to write, read, and manage data. 2. Computational models: to process and analyze data. Recently, there are the following Big Data frameworks: 1. Storage frameworks: Google File System (GFS), Hadoop Distributed File Systems (HDFS). 2. Computational models: MapReduce (Apache Hadoop), Resilient Distributed Datasets (RDD by Apache Spark).
31 Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases Big Data at SCI 2 S Final Comments. Challenges
32 Big Data Analytics Potential scenarios: Clustering Classification Real Time Analytics/ Big Data Streams Association Recommendation Systems Social Media Mining Social Big Data
33 Big Data Analytics: Tools Generation 1st 2nd Generation 3nd Generation Generation Examples SAS, R, Weka, Mahout, Pentaho, Spark, Haloop, GraphLab, SPSS, KEEL Cascading Pregel, Giraph, ML over Storm Scalability Vertical Horizontal (over Horizontal (Beyond Hadoop) Hadoop) Algorithms Available Algorithms Not Available Fault- Tolerance Huge collection of algorithms Practically nothing Single point of failure Small subset: sequential logistic regression, linear SVMs, Stochastic Gradient Decendent, k- means clustsering, Random forest, etc. Vast no.: Kernel SVMs, Multivariate Logistic Regression, Conjugate Gradient Descendent, ALS, etc. Most tools are FT, as they are built on top of Hadoop Much wider: CGD, ALS, collaborative filtering, kernel SVM, matrix factorization, Gibbs sampling, etc. Multivariate logistic regression in general form, k-means clustering, etc. Work in progress to expand the set of available algorithms FT: HaLoop, Spark Not FT: Pregel, GraphLab, Giraph
34 Big Data Analytics: Tools Mahout MLlib Version 1.4.1
35 Big Data Analytics: Tools MLlib
36 Big Data Analytics Case of Study: Random Forest for KddCup 99 The RF Mahout Partial implementation: is an algorithm that builds multiple trees for different portions of the data. Two phases: Building phase Classification phase 36
37 Big Data Analytics Case of Study: Random Forest for KddCup 99 Class Instance Number normal DOS PRB R2L U2R 52
38 Big Data Analytics Case of Study: Random Forest for KddCup 99 Class Instance Number normal DOS PRB R2L U2R 52 Cluster ATLAS: 16 nodes -Microprocessors: p 2 x Intel E (6 cores/12 threads, 2 GHz) - RAM 64 GB DDR3 ECC 1600MHz -Mahout version 0.8
39 Big Data Analytics:Two last comments Image Credit: Shutterstock Without Analytics, Big Data is Just Noise Guest post by Eric Schwartzman, founder and CEO of Comply Socially
40 Big Data Analytics:Two last comments Mahout Lack of big data preprocessing algorithms MLlib Version 1.4.1
41 Big Data Analytics Data Preprocessing Big Data Data Science Model building Predictive and descriptive Analytics Lack of big data preprocessing algorithms
42 Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases. Health, Social Media and People shopping mall identification Big Data at SCI 2 S Final Comments. Challenges
43 Real Cases. Health HEALTHCARE ORGANIZATIONS HAVE COLLECTED EXABYTES OF DATA LAST YEARS Big data can facilitate action on the modifiable risk factors that contribute to a large fraction of the chronic disease burden, such as physical activity, diet, tobacco use, and exposure to pollution. Big Data, Vol.1, 3, 2013
44 Real Cases. Health and Social Media. Detect pandemic risk in real time Google Flu Trends (2008) ( is a web service operated by Google. They could say, as the prevention and control centers, where the flu had spread but almost in real time, not one or two weeks. It provides estimates of influenza activity for more than 25 countries. ti By aggregating Google search queries (45 different terms) J. Ginsberg, M.H. Mohebbi, R.S. Patel, L. Brammer, M.S. Smolinski, L. Brilliant. Detecting influenza epidemics using search engine query data. Nature 475 (2009)
45 Real Cases. Health and Social Media. Detect pandemic risk in real time Google Flu J. Ginsberg, M.H. Mohebbi, R.S. Patel, L. Brammer, M.S. Smolinski, L. Brilliant. Detecting influenza epidemics using search engine query data. Nature 475 (2009)
46 Real Cases. Health and Social Media. Detect pandemic risk in real time Google Flu In 2013 he overestimated the levels of influenza (prevention and control centers estimate x2). The overestimation may be due to extensive media coverage of flu can modify behaviors search SCIENCE VOL MARCH 2014
47 Real Cases. Health and Social Media. Detect pandemic risk in real time Google Flu The models are updated annually 2014 model update applied to prior years The model has been extended to Dengue
48 Real Cases. Health and Social Media. Detect pandemic risk in real time Wikipedia is better than Google at tracking flu trends? Wikipedia traffic could be used to provide real time tracking of flu cases, according to the study published by John Brownstein. Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time David J. McIver, John S. Brownstein, Harvard Medical School, Boston Children s Hospital, Plos Computational Biology, 2014.
49 Real Cases. Health and Social Media You Are What You Tweet: Analyzing Twitter for Public Health Discovering Health Topics in Social Media Using Topic Models Michael J. Paul, Mark Dredze, Johns Hopkins University, Plos One, 2014 Analyzing user messages in social media can measure different population characteristics, including public health measures. A system filtering Twitter data can automatically infer health topics in 144 million Twitter messages from 2011 to ATAM discovered 13 coherent clusters of Twitter messages, some of which correlate with seasonal influenza (r = 0.689) and allergies (r = 0.810) temporal surveillance data, as well as exercise (r =.534) and obesity (r =2.631) related geographic survey data in the United States. 49
50 Real Cases. Shopping mall identification. 3 MONTHS OF CREDIT CARD RECORDS 11 MILLIONS OF PEOPLE SCIENCE VOL 347 January 30, 2015, pp
51 Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases Big Data at SCI 2 S Final Comments. Challenges
52 Big Data at SCI 2 S - UGR Bird's eye view SCI 2 S website
53 Big Data at SCI 2 S - UGR Our approaches: Big Data with Fuzzy Models Fuzzy Rule Based System for classification Fuzzy Rule Based System with cost sensitive for imbalanced data sets
54 Big Data at SCI 2 S - UGR Our approaches: Imbalanced Big Data Preprocessing: Random undersampling, oversampling Cost sensitive Bioinformatic Applications
55 Big Data at SCI 2 S - UGR Our approaches: Big Data Preprocessing Evolutionary data reduction Feature selection and discretization
56 ECBDL 14 Big Data Competition Vancouver, 2014 ECBDL 14 Big Data Competition 2014: Self-deployment track Objective: Contact map prediction Details: 32 million instances 631 attributes (539 real & 92 nominal values) 2 classes 98% of negative examples About 56.7GB of disk space Evaluation: True positive rate True negative rate TPR TNR J. Bacardit et al, Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features, Bioinformatics 28 (19) (2012)
57 Evolutionary Computation for Big Data and Big Learning Workshop ECBDL 14 Big Data Competition 2014: Self-deployment track The challenge: Very large size of the training set Does not fit all together in memory. Even large for the test set (5.1GB, 2.9 million instances) Relatively high dimensional data. Low ratio (<2%) of true contacts. t Imbalance rate: >49 Imbalanced problem! Imbalanced Big Data Classification
58 Imbalanced Big Data Classification A MapReduce Approach 32 million instances, 98% of negative examples Low ratio of true contacts (<2%). Imbalance rate: > 49. Imbalanced problem! Previous study on extremely imbalanced big data: S. Río, V. López, J.M. Benítez, F. Herrera, On the use of MapReduce for Imbalanced Big Data using Random Forest. Information Sciences 285 (2014) Over-Sampling Random Focused Under-Sampling Random Focused Cost Modifying i (cost-sensitive) Boosting/Bagging approaches (with preprocessing)
59 ECBDL 14 Big Data Competition Our approach: ECBDL 14 Big Data Competition Balance the original training data Random Oversampling (As first idea, it was extended) 2 Learning a model 2. Learning a model. Random Forest
60 ECBDL 14 Big Data Competition Oversampling rate: {100%} RandomForest
61 ECBDL 14 Big Data Competition
62 ECBDL 14 Big Data Competition Our approach: ECBDL 14 Big Data Competition Balance the original training data Random Oversampling incresing the ROS percentage (As first idea, it was extended) 2. Learning a model. Random Forest 3. Detect relevant features. es 1. Evolutionary Feature Weighting Classifying test set.
63 ECBDL 14 Big Data Competition How to increase the performance? Third component: MapReduce Approach for Feature Weighting for getting a major performance over classes
64 ECBDL 14 Big Data Competition Last decision: We investigated to increase ROS until 180% with 64 mappers 64 mappers Algorithms TNR*TPR Training TPR TNR TNR*TPR Test ROS+ RF (130%+ FW 90+25f+200t) ROS+ RF (140%+ FW 90+25f+200t) ROS+ RF (150%+ FW 90+25f+200t) ROS+ RF (160%+ FW 90+25f+200t) 0, ROS+ RF (170%+ FW 90+25f+200t) ROS+ RF (180%+ FW 90+25f+200t) 0, To increase ROS and reduce the mappers number lead us to get a trade-off with good results ROS replications of the minority instances
65 ECBDL 14 Big Data Competition Evolutionary Computation for Big Data and Big Learning Workshop Results of the competition: Contact map prediction TPR Team Name TPR TNR Acc TNR Efdamis ICOS UNSW HyperEns PUC-Rio_ICA Test EmeraldLogic EFDAMIS team ranked first in the ECBDL 14 big data competition LidiaGroup
66 ECBDL 14 Big Data Competition Our algorithm: ROSEFW-RF MapReduce Process Iterative MapReduce Process I. Triguero, S. del Río, V. López, J. Bacardit, J.M. Benítez, F. Herrera. ROSEFW-RF: The winner algorithm for the ECBDL'14 Big Data Competition: An extremely imbalanced big data bioinformatics problem imbalanced big data bioinformatics problem. Knowledge-Based Systems, Volume 87, October 2015, Pages
67 Outline Big Data. Big Data Science Why Big Data? Technologies: MapReduce Paradigm. Hadoop Ecosystem Big Data Analytics. Algorithms Real Cases Big Data at SCI 2 S Final Comments. Challenges
68 Final Comments Parallelization of machine learning algorithms with datapartitioning approaches can be successfully performed with MapReduce. Partitioning and applying the machine learning algorithm to each part. Focus on the combination phase (reduce). The combination of models is the challenge for each algorithm. Data Mining, Machine learning and data preprocessing: Huge collection of algorithms against big data analytics algorithms
69 Final Comments Data Mining, i Machine learning and data preprocessing: Huge collection of algorithms Big Data Analytics Big Data: A small subset of algorithms Big Data Preprocessing: A few methods for preprocessing in Big Data analytics. Deep learning: Neural networks based approach for signal/image processing in big data.
70 Final Comments Big data and analytics: a large challenge offering great opportunities A small subset of algorithms. It is necessary to redesign new algorithms Computing Model Accuracy and Approximation Efficiency requirements for Algorithm Quality data for quality models in big data analytics Quality decisions must be based on quality big data! Big Data Preprocessing Noise in data distorts Need automatic methods for cleaning the data Missing values management Big Data Reduction
71 Final Comments Where we are going: 3BigDatastages By Gregory Piatetsky, Dec 8, Big Data 3.0: Intelligent Google Now, Watson (IBM)... Big Data 3.0 would be a combination of data, with huge knowledge bases and a very large collection of algorithms, perhaps reaching the level of true Artificial Intelligence (Singularity?).
72 Final Comments
73 Thanks!!! Big Data: Technologies Big Data: Technologies and Applications
A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes
A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes Francisco Herrera Research Group on Soft Computing and Information Intelligent Systems (SCI 2 S) Dept.
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme
Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,
COMP9321 Web Application Engineering
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411
Machine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou [email protected] Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料
Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料 美 國 13 歲 學 生 用 Big Data 找 出 霸 淩 熱 點 Puri 架 設 網 站 Bullyvention, 藉 由 分 析 Twitter 上 找 出 提 到 跟 霸 凌 相 關 的 詞, 搭 配 地 理 位 置
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Big Data Explained. An introduction to Big Data Science.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
MapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data
CS535 Big Data W1.A.1 CS535 BIG DATA W1.A.2 Let the data speak to you Medication Adherence Score How likely people are to take their medication, based on: How long people have lived at the same address
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
Big Data Processing. Patrick Wendell Databricks
Big Data Processing Patrick Wendell Databricks About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks
SEIZE THE DATA. 2015 SEIZE THE DATA. 2015
1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. BIG DATA CONFERENCE 2015 Boston August 10-13 Predicting and reducing deforestation
Big Data and Analytics: Challenges and Opportunities
Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop
ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: [email protected]
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?
BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand? The Big Data Buzz big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database
Big Data. Lyle Ungar, University of Pennsylvania
Big Data Big data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus. McKinsey Data Scientist: The Sexiest Job of the 21st Century -
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Outline. What is Big data and where they come from? How we deal with Big data?
What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
Big Data a threat or a chance?
Big Data a threat or a chance? Helwig Hauser University of Bergen, Dept. of Informatics Big Data What is Big Data? well, lots of data, right? we come back to this in a moment. certainly, a buzz-word but
Streaming items through a cluster with Spark Streaming
Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: [email protected] November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Is a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum
Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Siva Ravada Senior Director of Development Oracle Spatial and MapViewer 2 Evolving Technology Platforms
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
Getting to Know Big Data
Getting to Know Big Data Dr. Putchong Uthayopas Department of Computer Engineering, Faculty of Engineering, Kasetsart University Email: [email protected] Information Tsunami Rapid expansion of Smartphone
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Are You Ready for Big Data?
Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?
What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani
What is Big Data? Concepts, Ideas and Principles Hitesh Dharamdasani # whoami Security Researcher, Malware Reversing Engineer, Developer GIT > George Mason > UC Berkeley > FireEye > On Stage Building Data-driven
Spark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify
Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics
BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University
Sunnie Chung. Cleveland State University
Sunnie Chung Cleveland State University Data Scientist Big Data Processing Data Mining 2 INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills:
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Big Data Analysis: Apache Storm Perspective
Big Data Analysis: Apache Storm Perspective Muhammad Hussain Iqbal 1, Tariq Rahim Soomro 2 Faculty of Computing, SZABIST Dubai Abstract the boom in the technology has resulted in emergence of new concepts
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
Big Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
