Massive scale analytics with Stratosphere using R
|
|
|
- Rachel Webb
- 10 years ago
- Views:
Transcription
1 Massive scale analytics with Stratosphere using R Jose Luis Lopez Pino [email protected] Database Systems and Information Management Technische Universität Berlin Supervised by Volker Markl Advised by Marcus Leich, Kostas Tzoumas August 28, 2014
2 Introduction Jose Luis Lopez Pino 2
3 Data analysis to the masses Deep analytics 1 : sophisticated statistical methods like linear models, clustering or classification that frequently are used to extract knowledge from the data. Data warehousing and BI can t answer all the questions. The ever-growing number of new data sources and tools make it worse. There is demand for this questions. In small scale: data pipelining tools (RapidMiner) and numerical computing environments (R, Matlab or SPSS). Big data brings new opportunities to the market but also presents unfamiliar challenges. 1 Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J Haas, and John McPherson. Ricardo: integrating r and hadoop. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages ACM, 2010 Jose Luis Lopez Pino [email protected] 3
4 Options R: R is a numerical computing environment and DSL for stats. Not a query language unlike SQL. Succesful for small scale (in combination with CRAN packages). MapReduce/Hadoop: Highly parallel programs but lack of expressivity. HDFS: a de-facto standard to store big amounts of data. Stratosphere: Platform for massively parallel computing / big data analytics. PACT: MapReduce + New operators + Iterations. Jose Luis Lopez Pino [email protected] 4
5 Basic terms and definitions KDD is compound of nine steps: understanding the domain and the goals, creating the target source, cleaning and processing the source, data reduction and projection, choosing a data mining method, choosing the data mining algorithm, mining the data, interpretation of the patterns. Figure: Overview of the process 2 2 Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The kdd process for extracting useful knowledge from volumes of data. Commun. ACM, 39(11):27 34, November 1996 Jose Luis Lopez Pino [email protected] 5
6 Motivation Jose Luis Lopez Pino 6
7 Clustering Jose Luis Lopez Pino 7
8 Classification Jose Luis Lopez Pino 8
9 Frequent Terms Jose Luis Lopez Pino 9
10 Writing massively parallel programs It is a cumbersome and onerous process. We need of single tools. We need tools that can process from a small amount of data up to very large volumes. The majority of data researchers are strongly skilled in R and statistics and poorly skills in Big Data systems and implementation of machine learning algorithm. 3 4 Although Stratosphere offers a more expressive interface, writing a parallel program is still not a trivial job. 3 Harlan Harris, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. O Reilly Media, Inc., Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. Enterprise data analysis and visualization: An interview study. Visualization and Computer Graphics, IEEE Transactions on, 18(12): , 2012 Jose Luis Lopez Pino [email protected] 10
11 Relation with the KDD process Data extraction is covered by other solutions. Pre-processing and transformation seem difficult. Data mining: where we have a competitive advantage. Data visualization is a different problem. Jose Luis Lopez Pino [email protected] 11
12 Design goals Easiness: ready-to-use algorithms. Design a library. Facilitate working with data. Easy to distribute. Focus on algorithms that scale. Jose Luis Lopez Pino [email protected] 12
13 Our approach Jose Luis Lopez Pino 13
14 Architecture Jose Luis Lopez Pino 14
15 Architecture Jose Luis Lopez Pino 15
16 Library: Goals Classification, clustering and regression. No Free Lunch Theorem: more than one algorithm. Presence in other ML libraries. Large-scale. Ensemble scenarios. Jose Luis Lopez Pino 16
17 Library: Example Jose Luis Lopez Pino 17
18 R package Easy to distribute. Organized in namespaces. Submitting jobs to the cluster. Working with files. Mining. Configuration. Jose Luis Lopez Pino 18
19 Introduction Motivation Our approach Related work Conclusions and Future Work Example: Code Jose Luis Lopez Pino 19
20 Example: Non-parallel classification example
21 Example: Parallel classification example
22 Example: Parallel clustering example
23 Performance Competitive and even faster than native R programs thanks to the pipelining for every parallelizable programs in the same (small) file size range. Competitive with R for data mining tasks with a lot of iterations in the same file size range. Able to process files of a volume that is inaccessible for R. Able to scale to gigabyte level without significant loss. Jose Luis Lopez Pino [email protected] 23
24 Performance: Frequent Terms example Jose Luis Lopez Pino 24
25 Performance: Most favorable case to R Figure: KMeans 100 iterations
26 Performance: Breakdown example Figure: Clustering example nonparallel breakdown (Time in seconds) Jose Luis Lopez Pino 26
27 Performance: Scalability example Figure: Frequent Terms parallel scalability Jose Luis Lopez Pino 27
28 Related work Jose Luis Lopez Pino 28
29 Data mining libraries Don t scale: Weka and sci-kit. Large-scale:. Mahout: limited set of problems. MLlib: also facilitates implementation of new algorithms. Oryx. In-database: MADlib and PivotalR. Jose Luis Lopez Pino [email protected] 29
30 Data intensive computation with R External memory. Don t scale-out: biglm, bigmemory, ff, foreach. RevoScaleR: xdf files and Hadoop. Divide and recombine: it s necessary to use the MR model. Query languages: Limited expressivity. Good for the first step of the KDD process. Distributed collection manipulation: Limited set of operators. Presto and SparkR. Jose Luis Lopez Pino [email protected] 30
31 Conclusions and Future Work Jose Luis Lopez Pino 31
32 Conclusion Contributions:. Library definition. File manipulation and cluster interaction. Scenarios that proof the concept. Code very similar to the original one. Promising performance evaluation. Jose Luis Lopez Pino 32
33 Future work Improvements in the library. Hybrid approaches. Distributed evaluation. Improvements in the architecture. Jose Luis Lopez Pino 33
34 Essential bibliography Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J Haas, and John McPherson. Ricardo: integrating r and hadoop. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages ACM, 2010 Hadley Wickham. Advanced R Programming. CRC Press, To appear Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinlnder, MatthiasJ. Sax, Sebastian Schelter, Mareike Hger, Kostas Tzoumas, and Daniel Warneke. The stratosphere platform for big data analytics. The VLDB Journal, pages 1 26, 2014 Hai Qian. Pivotalr: A package for machine learning on big data. The R Journal, 6(1):57 67, June 2014 Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The kdd process for extracting useful knowledge from volumes of data. Commun. ACM, 39(11):27 34, November 1996
35 Recap 1 Introduction Data analysis to the masses Options Basic terms and definitions 2 Motivation Motivating problems Writing massively parallel programs Relation with the KDD process Design goals 3 Our approach Architecture Library R package Example Performance 4 Related work Data mining libraries Data intensive computation with R 5 Conclusions and Future Work Conclusion Future work Essential bibliography
Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.
Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus
Workflow Management System for Stratosphere
Workflow Management System for Stratosphere 1 THESIS PRESENTATION BY SURYAMITA HARINDRARI SEPTEMBER 5 TH, 2014 THESIS ADVISOR: ASTERIOS KATSIFODIMOS, PHD THESIS SUPERVISOR: PROF. DR. VOLKER MARKL DATABASE
Memory Management in BigData
Memory Management in BigData Yanish Pradhananga 1, Shridevi C. Karande 2 1 Student, Maharashtra Institute of Technology, Pune 411038, India 2 Assistant Professor, Maharashtra Institute of Technology, Pune
SEIZE THE DATA. 2015 SEIZE THE DATA. 2015
1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. BIG DATA CONFERENCE 2015 Boston August 10-13 Predicting and reducing deforestation
Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher
Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher Outline 2 1. Retrospection 2. Stratosphere Plans 3. Comparison with Hadoop 4. Evaluation 5. Outlook Retrospection
Ontology construction on a cloud computing platform
Ontology construction on a cloud computing platform Exposé for a Bachelor's thesis in Computer science - Knowledge management in bioinformatics Tobias Heintz 1 Motivation 1.1 Introduction PhenomicDB is
EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data
EMC Greenplum Driving the Future of Data Warehousing and Analytics Tools and Technologies for Big Data Steven Hillion V.P. Analytics EMC Data Computing Division 1 Big Data Size: The Volume Of Data Continues
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
PivotalR: A Package for Machine Learning on Big Data
PivotalR: A Package for Machine Learning on Big Data Hai Qian Predictive Analytics Team, Pivotal Inc. [email protected] Copyright 2013 Pivotal. All rights reserved. What Can Small Data Scientists Bring
Big Data Research in Berlin BBDC and Apache Flink
Big Data Research in Berlin BBDC and Apache Flink Tilmann Rabl [email protected] dima.tu-berlin.de bbdc.berlin 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2015 Agenda About Data Management,
Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing
Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Index Contents Page No. Introduction . Data Mining & Knowledge Discovery
Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.
Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!
Scalable Machine Learning - or what to do with all that Big Data infrastructure
- or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection
Big Data Research in the AMPLab: BDAS and Beyond
Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned
The Stratosphere platform for big data analytics
The VLDB Journal DOI 10.1007/s00778-014-0357-y REGULAR PAPER The Stratosphere platform for big data analytics Alexander Alexandrov Rico Bergmann Stephan Ewen Johann-Christoph Freytag Fabian Hueske Arvid
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems
Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems Volker Markl [email protected] dima.tu-berlin.de dfki.de/web/research/iam/ bbdc.berlin Based on my 2014 Vision Paper On
Big Data Analytics. Chances and Challenges. Volker Markl
Volker Markl Professor and Chair Database Systems and Information Management (DIMA), Technische Universität Berlin www.dima.tu-berlin.de Big Data Analytics Chances and Challenges Volker Markl DIMA BDOD
ANALYTICS CENTER LEARNING PROGRAM
Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals
HiBench Installation. Sunil Raiyani, Jayam Modi
HiBench Installation Sunil Raiyani, Jayam Modi Last Updated: May 23, 2014 CONTENTS Contents 1 Introduction 1 2 Installation 1 3 HiBench Benchmarks[3] 1 3.1 Micro Benchmarks..............................
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.
Mike Maxey Senior Director Product Marketing Greenplum A Division of EMC 1 Greenplum Becomes the Foundation of EMC s Big Data Analytics (July 2010) E M C A C Q U I R E S G R E E N P L U M For three years,
What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO
What is Data Mining? Data Mining (Knowledge discovery in database) Data Mining: "The non trivial extraction of implicit, previously unknown, and potentially useful information from data" William J Frawley,
Comparison of Distributed Data-Parallelization Patterns for Big Data Analysis: A Bioinformatics Case Study
Comparison of Distributed Data-Parallelization Patterns for Big Data Analysis: A Bioinformatics Case Study Jianwu Wang, Daniel Crawl, Ilkay Altintas San Diego Supercomputer Center University of California,
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics
SAP Brief SAP HANA Objectives Transform Your Future with Better Business Insight Using Predictive Analytics Dealing with the new reality Dealing with the new reality Organizations like yours can identify
In-Memory Analytics for Big Data
In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...
Big Data. Lyle Ungar, University of Pennsylvania
Big Data Big data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus. McKinsey Data Scientist: The Sexiest Job of the 21st Century -
Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics
Predictive Analytics Powered by SAP HANA Cary Bourgeois Principal Solution Advisor Platform and Analytics Agenda Introduction to Predictive Analytics Key capabilities of SAP HANA for in-memory predictive
Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.
Bonus Chapter Ten Major Predictive Analytics Vendors In This Chapter Angoss FICO IBM RapidMiner Revolution Analytics Salford Systems SAP SAS StatSoft, Inc. TIBCO This chapter highlights ten of the major
What s next for the Berkeley Data Analytics Stack?
What s next for the Berkeley Data Analytics Stack? Michael Franklin June 30th 2014 Spark Summit San Francisco UC BERKELEY AMPLab: Collaborative Big Data Research 60+ Students, Postdocs, Faculty and Staff
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at [email protected].
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
An In-Depth Look at In-Memory Predictive Analytics for Developers
September 9 11, 2013 Anaheim, California An In-Depth Look at In-Memory Predictive Analytics for Developers Philip Mugglestone SAP Learning Points Understand the SAP HANA Predictive Analysis library (PAL)
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Chapter ML:XI. XI. Cluster Analysis
Chapter ML:XI XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster
Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI
Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: [email protected] Data Mining a step in A KDD Process Data mining:
PAXQuery: A Massively Parallel XQuery Processor
PAXQuery: A Massively Parallel XQuery Processor Jesús Camacho-Rodríguez, Dario Colazzo, Ioana Manolescu To cite this version: Jesús Camacho-Rodríguez, Dario Colazzo, Ioana Manolescu. PAXQuery: A Massively
Dynamic Data in terms of Data Mining Streams
International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
Azure Data Lake Analytics
Azure Data Lake Analytics Compose and orchestrate data services at scale Fully managed service to support orchestration of data movement and processing Connect to relational or non-relational data
Data Science Certificate Program
Information Technologies Programs Data Science Certificate Program Accelerate Your Career extension.uci.edu/datascience Offered in partnership with University of California, Irvine Extension s professional
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Role Description. Position of a Data Scientist Machine Learning at Fractal Analytics
Opportunity to work with leading analytics firm that creates Insights, Impact and Innovation. Role Description Position of a Data Scientist Machine Learning at Fractal Analytics March 2014 About the Company
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
SEIZE THE DATA. 2015 SEIZE THE DATA. 2015
1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Deep dive into Haven Predictive Analytics Powered by HP Distributed R and
An interdisciplinary model for analytics education
An interdisciplinary model for analytics education Raffaella Settimi, PhD School of Computing, DePaul University Drew Conway s Data Science Venn Diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS
EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS Susan P. Imberman Ph.D. College of Staten Island, City University of New York [email protected] Abstract
SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SAP @cgadalla SESSION CODE: 603
SAP Predictive Analytics: An Overview and Roadmap Charles Gadalla, SAP @cgadalla SESSION CODE: 603 Advanced Analytics SAP Vision Embed Smart Agile Analytics into Decision Processes to Deliver Business
Interactive data analytics drive insights
Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has
A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML
www.bsc.es A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML Josep Ll. Berral, Nicolas Poggi, David Carrera Workshop on Big Data Benchmarks Toronto, Canada 2015 1 Context ALOJA: framework
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Radoop: Analyzing Big Data with RapidMiner and Hadoop
Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large
How To Get The Most Out Of Big Data
Volker Markl Professor and Chair Database Systems and Information Management (DIMA), Technische Universität Berlin www.dima.tu-berlin.de Big Data Analytics Chances and Challenges Volker Markl DIMA BDOD
DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2
DATA SCIENCE CURRICULUM Before class even begins, students start an at-home pre-work phase. When they convene in class, students spend the first eight weeks doing iterative, project-centered skill acquisition.
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Federated Cloud-based Big Data Platform in Telecommunications
Federated Cloud-based Big Data Platform in Telecommunications Chao Deng dengchao@chinamobilecom Yujian Du duyujian@chinamobilecom Ling Qian qianling@chinamobilecom Zhiguo Luo luozhiguo@chinamobilecom Meng
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
From Spark to Ignition:
From Spark to Ignition: Fueling Your Business on Real-Time Analytics Eric Frenkiel, MemSQL CEO June 29, 2015 San Francisco, CA What s in Store For This Presentation? 1. MemSQL: A real-time database for
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
Demonstration of SAP Predictive Analysis 1.0, consumption from SAP BI clients and best practices
September 10-13, 2012 Orlando, Florida Demonstration of SAP Predictive Analysis 1.0, consumption from SAP BI clients and best practices Vishwanath Belur, Product Manager, SAP Predictive Analysis Learning
Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014
Big Data Analytics An Introduction Oliver Fuchsberger University of Paderborn 2014 Table of Contents I. Introduction & Motivation What is Big Data Analytics? Why is it so important? II. Techniques & Solutions
Big Data and Analytics (Fall 2015)
Big Data and Analytics (Fall 2015) Core/Elective: MS CS Elective MS SPM Elective Instructor: Dr. Tariq MAHMOOD Credit Hours: 3 Pre-requisite: All Core CS Courses (Knowledge of Data Mining is a Plus) Every
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Hybrid Software Architectures for Big Data. [email protected] @hurence http://www.hurence.com
Hybrid Software Architectures for Big Data [email protected] @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line
This Symposium brought to you by www.ttcus.com
This Symposium brought to you by www.ttcus.com Linkedin/Group: Technology Training Corporation @Techtrain Technology Training Corporation www.ttcus.com Big Data Analytics as a Service (BDAaaS) Big Data
From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data
100 001 010 111 From Raw Data to 10011100 Actionable Insights with 00100111 MATLAB Analytics 01011100 11100001 1 Access and Explore Data For scientists the problem is not a lack of available but a deluge.
Sunnie Chung. Cleveland State University
Sunnie Chung Cleveland State University Data Scientist Big Data Processing Data Mining 2 INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills:
Navigating Big Data business analytics
mwd a d v i s o r s Navigating Big Data business analytics Helena Schwenk A special report prepared for Actuate May 2013 This report is the third in a series and focuses principally on explaining what
In-Database Analytics
Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing
SQL Server 2005 Features Comparison
Page 1 of 10 Quick Links Home Worldwide Search Microsoft.com for: Go : Home Product Information How to Buy Editions Learning Downloads Support Partners Technologies Solutions Community Previous Versions
Massive Cloud Auditing using Data Mining on Hadoop
Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
