Data Scientist: From Mathematics to data management

Size: px
Start display at page:

Download "Data Scientist: From Mathematics to data management"

Transcription

1 Data Scientist: From Mathematics to data management Frederic Precioso 06/07/2015 Professor at University Nice Sophia Antipolis (UNS) Laboratory I3S Joint Research Unit from CNRS & UNS (UMR 7271) Team Scalable and Pervasive software and Knowledge Systems (SPARKS)

2 Outlook 1.Data scientist: the sexiest job of the 21st century? 2.Data scientist future profile 3.Data science in academic research 4.Future These slides are partially based on (Big) Data (Science) Skills by Oscar Corcho 2

3 Data Scientist: The Sexiest Job of the 21st Century? October 2012: the Harvard Business Review published the article "Data Scientist: The Sexiest Job of the 21st Century" in its issue "Getting control of Big Data". Since then a lot of work has been done to draw the conclusion that there are actually more than one data scientist profile. 3

4 Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work (June 2013) 4

5 Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work (June 2013) Based on the survey data of several hundred data science professionals, the authors applying data science algorithms found that data scientists could be clustered into 4 subgroups, each with a different mix of skillsets: Data Businessperson Data Creatives Data Developers Data Researchers 5

6 Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work (June 2013) ML = Machine Learning OR = Operations Research 6

7 Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work (June 2013) From their answers, the data scientists see themselves as T-shape experts. 7

8 More recently 8

9 Big Data Species 1. HPC and e-infrastructure Experts Background: Computer Science (Systems) System Administration Terms used in their native language: Blades, Infiniband, OpenMPI, racks, HDF, TBs, Gflops Their daily life: Check system logs Make sure that queues are active Install a new rack What s Big Data for them? A commercial term for something that they have done for a long time They really know how to configure and monitor a Hadoop cluster They would love seeing those talking about Big Data executing processes on fluid dynamics [source Oscar Corcho] 9

10 Big Data Species 2. Data Storage and Access Experts Background: Computer Science Database administration Terms used in their native language: SQL, NoSQL, Column store Transactions, Hive, TBs/PBs/, TPS (Transactions per s) Their daily life: Optimize several queries Run a new benchmark Design an optimizer/physical operator What s Big Data for them? A new opportunity to work on optimization algorithms They know how to configure a database They often laugh at those who deploy a NoSQL solution for a problem that can be solved with a relational database [source Oscar Corcho] 10

11 Big Data Species 3. Machine Learning Experts Background: Mathematics, Statistics, Physics, Computer Science Terms used in their native language: Complexity, algorithm, p-value, convergence, precision, recall ROC curves, Bayesian networks, R Their daily life: Read about a new problem Write down a few formulae in the whiteboard (even blackboards) Prove that the algorithm terminates What s Big Data for them? The same problems applied to data of larger size, with new challenges Problems are not only solved in Hadoop or a powerful NoSQL DB Astonished by those who still mix up correlation and causality [source Oscar Corcho] 11

12 Big Data Species 4. Slow-data Experts Background: Computer Science, Statistics, Library Sciences, Linguistics Terms used in their native language: Information model, vocabulary, ontology, data quality, curation Their daily life: Receive a database schema Talk to data producers and (re)users Obtain consensus and transform data What s Big Data for them? The difficulty lies on the variety of data formats and structures We may integrate data from varied sources, although this is not always possible When you manage to integrate heterogeneous data, you can achieve better results [source Oscar Corcho] 12

13 Big Data Species 5. (Big Data) Consultants Background: Computer Science, Economy, Terms used in their native language: Business model, business opportunity, Big Data, Data Value Chain, Hadoop, Spark, R, TBs, GFlops Their daily life: Read a Gartner Big Data report Talk to potential customers Transfer needs to technicians What s Big Data for them? It s the 4Vs, plus a few more I have a PPT presentation with a Big Data infrastructure, architecture, and previous projects, which I will use to sell a project to my customers [source Oscar Corcho] 13

14 BigData Ecosystem Visualization Dashboard (Kibana / Datameer) Maps (InstantAtlas, Leaflet, CartoDB ) Charts (GoogleCharts, Charts.js ) D3.js / Tableau / Flame Analysis Machine Learning (Scikit Learn, Mahout, Spark) Search / retrieval (Elastic Search, Solr) Storage / Access / Exploitation File System (HDFS, GGFS, Cassandra ) Access (Hadoop / Spark / Both, Sqoop) Databases / Indexing (SQL / NoSQL / Both, MongoDB, HBase, Infinispan) Exploit (LogStash, Flume ) Infrastructures Grid Computing / HPC Cloud / Virtualization 14

15 Intermediate Conclusions We all know that there are big opportunities in Big Data But we need to be more productive. For that we need: Understand that simply by using Hadoop, Spark or R we are not necessarily doing Big Data The same as by coding in Java we are not necessarily understanding object-oriented programming Understand that we have to interpret results adequately, from a scientific point of view Understand the importance of homogenizing datasets, in order to facilitate their integration (slow-data) Create real multidisciplinary teams [source Oscar Corcho] 15

16 Outlook 1.Data scientist: the sexiest job of the 21st century? 2.Data scientist future profile 3.Data science in academic research 4.Future These slides are partially based on (Big) Data (Science) Skills by Oscar Corcho 16

17 Future Profile: multidisciplinary Alex Szalay s T-shaped vs Pi-shaped Drew Conway's Data Science Venn Diagram Jim Gray's idea of the "Fourth Paradigm" of scientific discovery Volker Markl: Data Scientist Jack of All Trades! 17

18 Future Profile: multidisciplinary A recent report (in French) *, leads to the same conclusion: «The consensus nowadays is to define the data scientist at the intersection of three areas of expertise: (i) Computer Science, (ii) Statistics and Mathematics, and (iii) Business knowledge. ( ) Depending on the training program, one will most probably receive training with major either in Computer Science, in Statistics or Business knowledge.» * Serge Abiteboul, François Bancilhon, François Bourdoncle, Stephan Clemencon, Colin De La Higuera, et al.. L émergence d'une nouvelle filière de formation : " data scientists ". [Interne] INRIA Saclay <hal > 18

19 Outlook 1.Data scientist: the sexiest job of the 21st century? 2.Data scientist future profile 3.Data science in academic research 4.Future 19

20 BigData Academic Research Visualization R R Analysis R R R Storage / Access / Exploitation Infrastructures R R 20

21 What BigData Academic Research means? Push the limits of existing approaches or design new ones even if it is risky or (very) difficult Demonstrate that contributions are theoretically sound Compare to others through participating to challenges or at least on BigData benchmarks Complexity and scalability are always better when they can be proven 21

22 2 success stories of Machine learning among many Classification: How to separate the data? Machine Error Real (Algorithm) Error Empirical (Algorithm)+Capacity(Algorithm) 22

23 2 success stories of Machine learning among many Classification: How to separate the data? Error Real (Algorithm) Error Empirical (Algorithm)+Capacity(Algorithm) Boosting Machine Random Forests 23

24 Ideas of boosting: Football Bets If Varane and Sakho play together, French Football team wins. If Ntep is not injured, French Football team wins. If Benzema is substitued before the end, French Football team loses. If Pogba is happy, French Football team wins. From Antoine Cornuéjols Lecture slides 24

25 How to win? Ask to professional gamblers Lets assume: That professional gamblers can provide one single decision rule simple and relevant But that face to several games, they can always provide decision rules a little bit better than random Can we become rich? From Antoine Cornuéjols Lecture slides 25

26 Idea Ask heuristics to the expert Gather a set of cases for which these heuristics fail (difficult cases) Ask again the expert to provide heuristics for the difficult cases And so one Combine these heuristics expert stands for weak learner From Antoine Cornuéjols Lecture slides 26

27 Questions How to choose games (i.e. learning examples) at each step? Focus on games (examples) the most difficult (the ones on which previous heuristics are the less relevant) How to merge heuristics (decision rules) into one single decision rule? Take a weighted vote of all decision rules From Antoine Cornuéjols Lecture slides 27

28 Boosting boosting = general method to convert several poor decision rules into one very powerful decision rule More precisely: Let have a weak learner which can always provide a decision rule (even just little) better than random, A boosting algorithm can build (theoretically) a global decision rule with an error rate as low as desired. A theorem of Schapire on weak learning power proves that H gets a higher relevance than a global decision rule which would have been learnt directly on all training examples. From Antoine Cornuéjols Lecture slides 28

29 Probabilistic boosting: AdaBoost The standard algorithm is AdaBoost (Adaptive Boosting). 3 main ideas to generalize towards probabilistic boosting: 1. A set of specialized experts and ask them to vote to take a decision. 2. Adaptive weighting of votes by multiplicative update. 3. Modifying example distribution to train each expert, increasing the weights iteratively of examples misclassified at previous iteration. From Antoine Cornuéjols Lecture slides 29

30 AdaBoost: the algorithm A training set: S = {(x 1,y 1 ),,(x m,y m )} y i {-1,+1} label (annotation) of example x i S A set of weak learners {h t } For t = 0,,T: Give a weight to every sample in {1,,m} regarding its difficulty to be well classified by h t-1 : D t Find the weak decision ( heuristic ): h t : S {-1,+1} with the smallest error ε t on D t : εt = Pr D[ h( ) ] ( ) t t x y D i = i i t Compute the influence/impact of h t ih : t ( x ) i y i Final decision H final = a majority weighted vote of all the h t 30

31 Error of generalization for AdaBoost Error of generalization of H can be bounded by: E Real ( H ) = E ( H ) + Ο T Empirical T m T. d Error Iterations where T is the number of boosting iterations m the number of training examples d the dimension of H T space ( weaks learner complexity ) 31

32 The Task of Face Detection Many slides adapted from P. Viola 32

33 Basic Idea Slide a window across image and evaluate a face model at every location. 33

34 Image Features Feature Value = (Pixel in white area) (Pixel in black area) if < 29 1 if < 26 1 if > 11 h1 ( ) = h2 ( ) = h3 ( ) = 0 otherwise 0 otherwise 0 otherwise 34

35 AdaBoost Cascade Principle AdaBoost AdaBoost 1 Face x 99% 2 Face x 98% Non Face x 30% Non Face x 9% N Non Face x 70% Non Face x 21% Face x 90% Non Face x % 35

36 The Implemented System Training Data 5000 faces All frontal, rescaled to 24x24 pixels 300 million non-faces sub-windows 9500 non-face images Faces are normalized Scale, translation Many variations Across individuals Illumination Pose 36

37 Results Fixed images Video sequence Frontal face Left profile face Right profile face 37

38 Extension Fast and robust Other descriptors Other cascades (rotation ) Eye detection, Hand detection, Body detection 38

39 2 success stories of Machine learning among many Classification: How to separate the data? Error Real (Algorithm) Error Empirical (Algorithm)+Capacity(Algorithm) Boosting Machine Random Forests 39

40 Decision tree to decide playing tennis or not Objective 2 classes: yes & no Prediction if a game will be played or not Temperature will be easily converted into numerical I.H. Witten and E. Frank, Data Mining, Morgan Kaufmann Pub.,

41 Decision tree to decide playing tennis or not Class: NO Class:YES Class: YES 41

42 Final decision tree 42

43 Decision trees do not converge? Make a forest 43

44 Error of generalization for Random Forest Error of generalization of RF can be bounded by: E Real ( RF ) 2 2 ρ(1 s ) s where ρ is the mean correlation between two decision trees s is the quality of prediction of the set of decision trees 44

45 Success story: Kinect From Real-Time Human Pose Recognition in Parts from a Single Depth Image, Jamie Shotton, Andrew Fitzgibbon, 45 Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, and Andrew Blake at CVPR June 2011.

46 Success story: Kinect 46

47 Other success stories Support Vector Machines E Real ( SVM ) = ( SVM ) E Empirical d ln m: the number of training examples d: the dimension of decision space Bound valid with probability 1 - α + 2m d α + 1 ln 4 m Artificial Neural Network and Deep Learning 47

48 Outlook 1.Data scientist: the sexiest job of the 21st century? 2.Data scientist future profile 3.Data science in academic research 4.Future 48

49 Future trainees Before considering applying a method or a technology, be sure that original conditions are verified When a method is extended out of its domain of validity, intend to prove the mathematical consistency / stability of the new method Demonstrate or at least provide insights of its complexity and scalability In the very next years, new students will come out with a more global vision of data science challenges, a deep understanding of involved layers and a better knowledge of powerful techniques. 49

50 I will be glad to answer to any question Frederic Precioso 06/07/2015 Professor at University Nice Sophia Antipolis (UNS) Laboratory I3S Joint Research Unit from CNRS & UNS (UMR 7271) Team Scalable and Pervasive software and Knowledge Systems (SPARKS)

Robust Real-Time Face Detection

Robust Real-Time Face Detection Robust Real-Time Face Detection International Journal of Computer Vision 57(2), 137 154, 2004 Paul Viola, Michael Jones 授 課 教 授 : 林 信 志 博 士 報 告 者 : 林 宸 宇 報 告 日 期 :96.12.18 Outline Introduction The Boost

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture" What is Data Science?" Data Science aims to derive knowledge!

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved. Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!

More information

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley Disclaimer: This material is protected under copyright act AnalytixLabs, 2011. Unauthorized use and/ or duplication of this material or

More information

Local features and matching. Image classification & object localization

Local features and matching. Image classification & object localization Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

BIG DATA & DATA SCIENCE

BIG DATA & DATA SCIENCE BIG DATA & DATA SCIENCE ACADEMY PROGRAMS IN-COMPANY TRAINING PORTFOLIO 2 TRAINING PORTFOLIO 2016 Synergic Academy Solutions BIG DATA FOR LEADING BUSINESS Big data promises a significant shift in the way

More information

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08

More information

Integrating a Big Data Platform into Government:

Integrating a Big Data Platform into Government: Integrating a Big Data Platform into Government: Drive Better Decisions for Policy and Program Outcomes John Haddad, Senior Director Product Marketing, Informatica Digital Government Institute s Government

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

SURVEY REPORT DATA SCIENCE SOCIETY 2014

SURVEY REPORT DATA SCIENCE SOCIETY 2014 SURVEY REPORT DATA SCIENCE SOCIETY 2014 TABLE OF CONTENTS Contents About the Initiative 1 Report Summary 2 Participants Info 3 Participants Expertise 6 Suggested Discussion Topics 7 Selected Responses

More information

Big Data & Security. Aljosa Pasic 12/02/2015

Big Data & Security. Aljosa Pasic 12/02/2015 Big Data & Security Aljosa Pasic 12/02/2015 Welcome to Madrid!!! Big Data AND security: what is there on our minds? Big Data tools and technologies Big Data T&T chain and security/privacy concern mappings

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved. Big Data Analytics 1 Priority Discussion Topics What are the most compelling business drivers behind big data analytics? Do you have or expect to have data scientists on your staff, and what will be their

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

ANALYTICS CENTER LEARNING PROGRAM

ANALYTICS CENTER LEARNING PROGRAM Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

More information

Big Data Analytics and Optimization

Big Data Analytics and Optimization Big Data Analytics and Optimization C e r t i f i c a t e P r o g r a m i n E n g i n e e r i n g E x c e l l e n c e e.edu.in http://www.insof LIST OF COURSES Essential Business Skills for a Data Scientist...

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand? BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand? The Big Data Buzz big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database

More information

The Visual Internet of Things System Based on Depth Camera

The Visual Internet of Things System Based on Depth Camera The Visual Internet of Things System Based on Depth Camera Xucong Zhang 1, Xiaoyun Wang and Yingmin Jia Abstract The Visual Internet of Things is an important part of information technology. It is proposed

More information

Has been into training Big Data Hadoop and MongoDB from more than a year now

Has been into training Big Data Hadoop and MongoDB from more than a year now NAME NAMIT EXECUTIVE SUMMARY EXPERTISE DELIVERIES Around 10+ years of experience on Big Data Technologies such as Hadoop and MongoDB, Java, Python, Big Data Analytics, System Integration and Consulting

More information

Active Learning with Boosting for Spam Detection

Active Learning with Boosting for Spam Detection Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Some Research Challenges for Big Data Analytics of Intelligent Security

Some Research Challenges for Big Data Analytics of Intelligent Security Some Research Challenges for Big Data Analytics of Intelligent Security Yuh-Jong Hu hu at cs.nccu.edu.tw Emerging Network Technology (ENT) Lab. Department of Computer Science National Chengchi University,

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

The? Data: Introduction and Future

The? Data: Introduction and Future The? Data: Introduction and Future Husnu Sensoy Global Maksimum Data & Information Technologies Global Maksimum Data & Information Technologies The Data Company Massive Data Unstructured Data Insight Information

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence Government of Russian Federation Federal State Autonomous Educational Institution of High Professional Education National Research University «Higher School of Economics» Faculty of Computer Science School

More information

Experimentation on Cloud Databases to Handle Genomic Big Data

Experimentation on Cloud Databases to Handle Genomic Big Data Experimentation on Cloud Databases to Handle Genomic Big Data Presented by: Abraham Gómez, M.Sc., B.Sc. Academic Advisor: Alain April. Ph.D,M.Sc.A, B.A. abraham-segundo.gomez.1@ens.etsmtl.ca Agenda 1 2

More information

Big Data. Lyle Ungar, University of Pennsylvania

Big Data. Lyle Ungar, University of Pennsylvania Big Data Big data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus. McKinsey Data Scientist: The Sexiest Job of the 21st Century -

More information

Client Overview. Engagement Situation. Key Requirements

Client Overview. Engagement Situation. Key Requirements Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

Introduction to Big Data Training

Introduction to Big Data Training Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE Venu Govindaraju BIOMETRICS DOCUMENT ANALYSIS PATTERN RECOGNITION 8/24/2015 ICDAR- 2015 2 Towards a Globally Optimal Approach for Learning Deep Unsupervised

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Cloud Big Data Architectures

Cloud Big Data Architectures Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil 2016 About this Workshop Real-world Cloud Scenarios w/aws, Azure and GCP 1. Big Data Solution Types 2. Data Pipelines 3. ETL and Visualization

More information

The 4 Pillars of Technosoft s Big Data Practice

The 4 Pillars of Technosoft s Big Data Practice beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

Big Data and Analytics: Challenges and Opportunities

Big Data and Analytics: Challenges and Opportunities Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif

More information

So What s the Big Deal?

So What s the Big Deal? So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data

More information

Building Your Big Data Team

Building Your Big Data Team Building Your Big Data Team With all the buzz around Big Data, many companies have decided they need some sort of Big Data initiative in place to stay current with modern data management requirements.

More information

Consulting and Systems Integration (1) Networks & Cloud Integration Engineer

Consulting and Systems Integration (1) Networks & Cloud Integration Engineer Ericsson is a world-leading provider of telecommunications equipment & services to mobile & fixed network operators. Over 1,000 networks in more than 180 countries use Ericsson equipment, & more than 40

More information

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate

More information

Big Data Cloud Services

Big Data Cloud Services Big Data Cloud Services G-Cloud IV Service Definition Lot 4 - SCS Contact us: Danielle Pratt Email: G-Cloud@esynergy-solutions.co.uk About is a leading provider of IT Consultancy Services operating within

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

OpenChorus: Building a Tool-Chest for Big Data Science

OpenChorus: Building a Tool-Chest for Big Data Science OpenChorus: Building a Tool-Chest for Big Data Science Milind Bhandarkar Chief Scientist, Machine Learning Platforms EMC Greenplum 1 Agenda! Tools for Data Science! Data Science Workflow! Greenplum OpenChorus!

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

POSTGRAD PLACEMENTS. Placements are an integral part of the Masters programmes, so international students will not require additional work visas.

POSTGRAD PLACEMENTS. Placements are an integral part of the Masters programmes, so international students will not require additional work visas. POSTGRAD PLACEMENTS COMPUTATIONAL FINANCE DATA SCIENCE AND ANALYTICS MACHINE LEARNING KEY INFORMATION Placements can start in the middle of June 2015 or later and must finish by the middle of June 2016

More information

Certificate Program in Applied Big Data Analytics in Dubai. A Collaborative Program offered by INSOFE and Synergy-BI

Certificate Program in Applied Big Data Analytics in Dubai. A Collaborative Program offered by INSOFE and Synergy-BI Certificate Program in Applied Big Data Analytics in Dubai A Collaborative Program offered by INSOFE and Synergy-BI Program Overview Today s manager needs to be extremely data savvy. They need to work

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée T., @becomingdatasci, November 2014

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée T., @becomingdatasci, November 2014 What is Data Science? { Data, Databases, and the Extraction of Knowledge Renée T., @becomingdatasci, November 2014 Let s start with: What is Data? http://upload.wikimedia.org/wikipedia/commons/f/f0/darpa

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends Spring 2015 Thomas Hill, Ph.D. VP Analytic Solutions Dell Statistica Overview and Agenda Dell Software overview Dell in

More information

QUICK FACTS. Delivering a Unified Data Architecture for Sony Computer Entertainment America TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES

QUICK FACTS. Delivering a Unified Data Architecture for Sony Computer Entertainment America TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES [ Consumer goods, Data Services ] TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES QUICK FACTS Objectives Develop a unified data architecture for capturing Sony Computer Entertainment America s (SCEA)

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84 Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics

More information

An interdisciplinary model for analytics education

An interdisciplinary model for analytics education An interdisciplinary model for analytics education Raffaella Settimi, PhD School of Computing, DePaul University Drew Conway s Data Science Venn Diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

More information

Data Analytics and Business Intelligence (8696/8697)

Data Analytics and Business Intelligence (8696/8697) http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/36 Data Analytics and Business Intelligence (8696/8697) Ensemble Decision Trees Graham.Williams@togaware.com Data Scientist Australian

More information

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

More information

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research & BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research & Innovation 04-08-2011 to the EC 8 th February, Luxembourg Your Atos business Research technologists. and Innovation

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Data Science at U of U

Data Science at U of U Data Science at U of U Je M. Phillips Assistant Professor, School of Computing Center for Extreme Data Management, Analysis, and Visualization Director, Data Management and Analysis Track University of

More information

Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics

Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics BY FRANÇOYS LABONTÉ GENERAL MANAGER JUNE 16, 2015 Principal partenaire financier WWW.CRIM.CA ABOUT CRIM Applied research

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

Spark: Cluster Computing with Working Sets

Spark: Cluster Computing with Working Sets Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs

More information

Monday Morning Data Mining

Monday Morning Data Mining Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik

More information

Big Data Analytics Platform @ Nokia

Big Data Analytics Platform @ Nokia Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Upcoming Announcements

Upcoming Announcements Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC jmarkham@hortonworks.com Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Analysis of Big Data Survey 2015 on Skills, Training and Capacity Building

Analysis of Big Data Survey 2015 on Skills, Training and Capacity Building Analysis of Big Data Survey 2015 on Skills, Training and Capacity Building D R A F T Version 1.0 12 Oct 2015 By UN Global Working Group on Big Data for Official Statistics Task Team on Skills, Training

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Human Pose Estimation from RGB Input Using Synthetic Training Data

Human Pose Estimation from RGB Input Using Synthetic Training Data Human Pose Estimation from RGB Input Using Synthetic Training Data Oscar Danielsson and Omid Aghazadeh School of Computer Science and Communication KTH, Stockholm, Sweden {osda02, omida}@kth.se arxiv:1405.1213v2

More information

Training for Big Data

Training for Big Data Training for Big Data Learnings from the CATS Workshop Raghu Ramakrishnan Technical Fellow, Microsoft Head, Big Data Engineering Head, Cloud Information Services Lab Store any kind of data What is Big

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

MACHINE LEARNING BASICS WITH R

MACHINE LEARNING BASICS WITH R MACHINE LEARNING [Hands-on Introduction of Supervised Machine Learning Methods] DURATION 2 DAY The field of machine learning is concerned with the question of how to construct computer programs that automatically

More information

High Productivity Data Processing Analytics Methods with Applications

High Productivity Data Processing Analytics Methods with Applications High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Natalia Vassilieva, PhD Senior Research Manager GTC 2016 Deep learning proof points as of today Vision Speech Text Other Search & information

More information

Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition

Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition Michael A. Schuh1, Rafal A. Angryk2 1 Montana State University, Bozeman, MT 2 Georgia State University, Atlanta, GA Introduction

More information

Industry 4.0 and Big Data

Industry 4.0 and Big Data Industry 4.0 and Big Data Marek Obitko, mobitko@ra.rockwell.com Senior Research Engineer 03/25/2015 PUBLIC PUBLIC - 5058-CO900H 2 Background Joint work with Czech Institute of Informatics, Robotics and

More information

Big Data Analytics and Optimization

Big Data Analytics and Optimization Big Data Analytics and Optimization C e r t i f i c a t e P r o g r a m i n E n g i n e e r i n g E x c e l l e n c e C e r t i f i c a t e P r o g r a m s i n A c c e l e r a t e d E n g i n e e r i n

More information

locuz.com Big Data Services

locuz.com Big Data Services locuz.com Big Data Services Big Data At Locuz, we help the enterprise move from being a data-limited to a data-driven one, thereby enabling smarter, faster decisions that result in better business outcome.

More information