Big Data Analytics in Astrophysics. Katharina Morik, Informatik 8, TU Dortmund

Size: px
Start display at page:

Download "Big Data Analytics in Astrophysics. Katharina Morik, Informatik 8, TU Dortmund"

Transcription

1 Big Data Analytics in Astrophysics Katharina Morik, Informatik 8, TU Dortmund

2 Overview Short introduction to the Collaborative Research Center SFB 876 Zooming into one of its projects: C3 Astrophysical Data Analysis IceCube Magic, FACT Tools for data analysis, streaming data Science today is based on data. Data analysis is intrinsically tied to the scientific process. Active Galactic Nuclei

3 Collaborative Research Center SFB 876: Providing Information by Resource-Constrained Data Analysis Already granted:

4 SFB 876: Cyberphysical Systems Ubiquitous systems Smart phones Multimedia Navigation Support of elder people Sensor measurements Factories Medicine Science Goals: Analysis and prediction available: Anytime, Everywhere!

5 SFB 876: Big Data Analytics Statistical analysis: Carefully collected data Sophisticated analysis. Data mining: Large data sets gathered for other purposes than analysis Simple, reliable models. Big data analytics: Large volume, velocity, variety data New storage and processing architectures. Goals: Scalable algorithms, Algorithms for new architectures

6 SFB 876: Resource-aware algorithms Cyberphysical systems produce big data. Big data analytics delivers data summaries, predictions. Foundations Beyond runtime and sample complexity! Future: memory- and energyefficient analytics. Collaborative research center SFB 876 with 13 projects, 20 professors and about 50 Ph D students.

7 Christian Sohler, theoretical CS Sangkyun Lee, Data Analysis CS Olaf Spinczyk, Embedded Systems CS Alexander Schramm, Biomedicine Christian Wietfeld, Electrical Engin. Michael Schreckenberg, Traffic Physics Jochen Deuse, Machining Jian-Jia Chen, Embedded Systems CS Kristian Kersting, Data Analysis CS Katharina Morik, Data Analysis CS Heinrich Müller, Visual Analytics CS Petra Mutzel, Algorithm Engin. CS Jörg Rahnenführer, Statistics Peter Marwedel, Embedded Systems CS Katja Ickstadt, Statistics Tim Ruhe, Astrophysics Wolfgang Rhode, Astrophysics Bernhard Spaan, Particle physics Jens Teubner, Databases CS Michael Ten Hompel, Logistics Project leaders

8 Overview Short introduction to the Collaborative Research Center SFB 876 Zooming into one of its projects: C3 Astrophysical Data Analysis IceCube Magic, FACT Tools for data analysis, streaming data Science today is based on data. Data analysis is intrinsically tied to the scientific process. Active Galactic Nuclei

9 Big Data Analytics in Astrophysics High Energy Astrophysical Phenomena Neutrinos Gamma rays Instrumentation and Methods for Astrophysics Telescopes Data Analysis IceCube Collaboration Magic I, II, FACT Project C3 of SFB 876 Prof. Dr. Dr. Wolfgang Rhode Prof. Dr. Katharina Morik Dr. Tim Ruhe Big Data is Volume, Velocity, Variety. Big Data helps if rare events are outliers. Here, we are looking for the needle in the haystack: neutrinos and gamma rays are dominated by other particles. Data analysis tool RapidMiner Feature selection MRMR Streams framework

10 Structure of the remaining talk the data analysis cycle Data Analysis is not just one step in the overall scientific investigation. Data Analysis accompanies the scientific investigation. Data Analytics is interdisciplinary in nature: Physics Statistics Data management Machine learning Software/Algorithm engineering Complexity Theory Insight Physics questions

11 Data Analysis Tools - Requirements Coverage: Support the overall cycle Amenability: Easy to use for all involved scientists Rapidity: Fast creation of analysis process Comparability: international scientific terminology Reproducibility: storing the analysis process with all parameters in a small format. Not just the modeling step. Not just for statisticians, computer scientists, physicists but for all of them. Not a bunch of programs/libraries to be used in programming. Not a new name with a little variance for known methods that have a theoretic basis. Not asking the programmer whether he remembers which parameters succeeded.

12 RapidMiner Coverage: Support the overall cycle Amenability: Easy to use for all involved scientists Rapidity: Fast creation of analysis process Comparability: international scientific terminology Reproducibility: storing the analysis process with all parameters in a small format, re-run, exchange. If 115 data transformations and 264 modeling operators are not enough: Write your own operator and plug it into RapidMiner!

13 Physics Questions Explaining Birth and Death of Stars What happens when particles collide and how, exactly? High Energy Astrophysical Phenomena Neutrinos Gamma rays Instrumentation and Methods for Astrophysics Telescopes Data Analysis Insight Physics questions

14 Death of a star 1054 Chinese astronomers detect a supernova. Its remnant is the crab nebula Jocelyn Bell, Antony Hewish discover a pulsar, which comes from a rotating neutron star. The heat of the dying star has turned electrons and protons to neutrons. Neutron stars are composed of neutrons. This releases a flux of neutrinos. Neutron stars may produce gamma-rays.

15 Neutrino genesis b - - decay: n p + e - + n e Neutron becomes Proton Electron, Anti-Electron- Neutrino b + - decay: p n + e + + n e Proton becomes Neutron Positron, Electron-Neutrino Neutrinos accompanying Electron Mu-meson (myon or muon) Tauon Myon neutrino collides with proton of hydrogen resulting in a positive pion and a negative myon.

16 Physics Questions Explaining Birth and Death of Stars What happens when particles collide and how, exactly? High Energy Astrophysical Phenomena Neutrinos Gamma rays Instrumentation and Methods for Astrophysics Telescopes Data Analysis Insight Physics questions

17 Faster than light Cherenkov radiation has been observed by Marie Curie and interpreted by Pawel Cherenkov. If a loaded particle in a medium moves faster than light, the blue Cherenkov light shows up. The medium can be: Water Ice Air. A "faster than light" neutrino discovery was actually the result of a loose cable. So the scientific community can rest easy. CERN sent a beam of neutrinos to the Gran Sasso laboratory, OPERA.

18 Breakthrough of the year 2013: IceCube The IceCube project has been awarded the 2013 Breakthrough of the Year by the British magazine Physics World. Detection of 28 neutrinos from outside the Milky Way 2 neutrinos with 1 PeV (Ernie and Bert), one even 2 PeV (Big Bird). Model the distribution of all neutrinos (frequency of neutrinos at all energies)! Separate neutrinos from other particles. Interaction in detector cosmic neutrinos IceCube collaboration: 36 research institutes in 8 countries, Francis Halzen, University of Wisconsin-Madison, Olga Botner, University of Uppsala (speakers) Atmospheric neutrinos IceCube Magic Interaction in orbit proton

19 Magic and FACT Cherenkov Telescopes Air shower is visible in the camera for about 150 Nanoseconds pixels of the camera sampled at 2 GHz rate. Hardware filter decides about storage. Records of sampled voltages of camera pixels. Model the distribution of gamma rays Separate gamma from other rays

20 Data understanding What do the data look like? How to access the data? Do we need real-time access? Physics questions Data understanding Insight

21 Data volume challenging data analysis IceCube at the South Pole: within the holes, there are Digital Optical Modules (DOM) mounted, that record Cherenkov light. The trace of muons through the 3D array of DOMs shows also the direction of incoming neutrinos. 1 Terabyte per day = 1 Mio. Gigabyte A satellite takes about 10 years to transport the data of a year to the University Wisconsin. A ship transports the data on hard discs in 28 days. No real-time analysis. Ships are faster than satellites for data transmission.

22 Skewed distribution challenging data analysis Calibration, cleaning Feature extraction Signal separation Energy estimation A simulator provides labeled observations. Gamma rays of high energy are rare events as opposed to hadrons, ratio 1 to MAGIC I (2003) and MAGIC II (2009) La Palma, Roque de los Muchachos FACT telescope, same type, same place

23 FACT-Viewer

24 Streams framework for real-time processing Streams framework offers highlevel view of streaming data (Bockermann,Blom 2012). There is a plugin for RapidMiner. It incorporates MOA for the analysis of streaming data. It compiles into Storm for execution, if wanted. The Fact-tools are a set of Java classes that provide an extension to the streams framework for reading and processing data from the FACT telescope.

25 Data understanding what is to be learned? IceCube: Overall spectrum of atmospheric neutrinos from 100 Giga ev to 1 Peta ev demands better background rejection. MAGIC, FACT: Better separation of hadrons and gamma rays. Reproducible results. Insight Physics questions Data understanding Tim Ruhe, Katharina Morik, Wolfgang Rhode Application of RapidMiner in Neutrino Astronomy in: Hofmann, Klinkenberg (eds) RapidMiner Data Mining Use Cases and Business Analytics Applications, CRC Press 2014 Mark Aartsen Katharina Morik, Wolfgang Rhode, Tim Ruhe Development of a general Analysis and Unfolding Scheme and its Application to Measure the Energy Spectrum of Atmospheric Neutrinos with IceCube in: Eur.Phys.Journal 2014

26 Data preparation training samples and features Samples Balanced stratified sample enhances learning. Training: signal neutrino events Labeling by simulation. IceCube 59 strings measurements of 346 days raw events after manual cuts No true labels given, but estimate according to distribution of neutrinos. Insight Physics questions Data preparation

27 Data preparation Feature selection Too many features hide the true pattern and slow down modeling. Redundant features may lead to wrong results. Wrong: Two features with the same meaning two times an impact. Right: Each feature should have half an impact. Quality of a feature set is given by the performance of learning using the feature set. Insight Physics questions Data preparation

28 Q -- MRMR Minimum Redundancy Maximum Relevance (MRMR) Start with empty feature set. Add the one with lowest redundancy to already chosen features D(x,x) and highest relevance w.r.t. label R(x,y) Efficient implementation by Benjamin Schowe in RapidMiner s Feature Selection Extension 1 Q R( x, y) D( x, x) j x in F j Mark Hall Correlation Based Feature Selection 1999

29 Feature selection in RapidMiner 2

30 Overfitting to a sample? Measuring stability! Split the training data into m sets. Do feature selection on m- 1 sets. Use the selected features for learning and test the learner s performance. We do that m times. One sample leads to the feature set A, another to the feature set B, both select k features out of n given ones. Jaccard Index J A B A B

31 Feature transformation and extraction Right representation eases understanding and modeling, e.g., time series: deterministic, with outlier, level change y t y t time t y t+1 y t HR t timet y t+1 y t U.Gather, M. Bauer time t y t

32 Feature selection and extraction in IceCube 200 features given, diverse constructions over raw data. Feature selection using MRMR particularly successful: Redundant features in the data, e.g., the zenith angle or its consequence is obtained from several reconstruction algorithms. 25 features selected with stability over 0.8. Creating features by binning (spectrum unfolding) and learning Tim Ruhe s method. Novel spectrum unfolding: Partition attribute values into bins (intervals) Intervals become labled classes Run multiclass learning output confidence of each class for each example ex 1 ex n bin 1 bin r Neutrino? conf Learn classifier from these data.

33 Binning in RapidMiner Numerical data Discretized into 4 bins Discretized to equal frequency bins

34 Modeling tasks classification and regression Given observations x with labels y {(x 1,y 1 ), (x n,y n )} with binominal y (classification) with real-valued y (regression) Find f(x)=y such that the error is minimized. RapidMiner offers 167 learning algorithms for classification and regression. IceCube: y in {neutrino, not neutrino} Algorithm should be robust, scalable, parallel! Insight Physics questions Modeling

35 Types of Models Lazy Modeling, Local Models K-NeirestNeighbors Additive Models Decision Trees Linear Models Linear Regression Support Vector Machine Bayesian Models

36 Ensemble Methods here: for decision trees Ensemble: Take many models and decide according to the majority vote! Algorithm Training: For l decision trees (parallel): (1) Take a sample from the data (2) Take a subset of features (3) Choose the best feature according to minimal entropy Split according to feature If purity not ok, goto (3). Weka Random Forest in RapidMiner parallel Breiman 2001, Machine Learning Journal

37 Evaluation Training and testing on different samples is necessary in order to estimate the true error. Testing on the same set would overestimate the quality of the model. Best test: leave one out requires for n observations n runs of modeling. This is not efficient. Crossvalidation: split into m partitions, use m-1 for training and test on the unused one. Output the average of all tests. Fair. Insight Physics questions Evaluation

38 Cross validation in RapidMiner Your measurement is meaningless without knowledge of the error. Walter Levin 1 2 2

39 IceCube Results We enhanced the neutrino recognition by 62% (IceCube collaboration and Morik 2014). Quality cuts lead from full data D to D, rejecting the easy 91.4% of background. Random Forest leads from D to D so that % of background muons are rejected. At this background rejection atmospheric neutrino events were detected in 346 days of IceCube 59. Insight Physics questions

40 IceCube Results For the first time, the spectrum of atmospheric neutrinos could be reconstructed up to an energy level 1 PeV. Now, theoretical astrophysics can build upon this. Physics questions Insight

41 Empirical work for theory development En [GeV cm F n s sr -1 ] log (E [GeV]) 10 n IC-59 n m Unfolding AMANDA n m unfolding IC-79 n e flux Gaisser H3a Gaisser H3a No Charm Frejus n m Frejus n e Frejus n m model Frejus n e model Honda n e Insight Physics questions Evaluation

42 Interdisciplinary work

43 Big Data Analytics in Astrophysics High Energy Astrophysical Phenomena Neutrinos Gamma rays Instrumentation and Methods for Astrophysics Telescopes Data Analysis IceCube Collaboration Magic I, II, FACT Project C3 of SFB 876 Prof. Dr. Dr. Wolfgang Rhode Prof. Dr. Katharina Morik Dr. Tim Ruhe Big Data is Volume, Velocity, Variety. Big Data helps if rare events are outliers. Here, we are looking for the needle in the haystack: neutrinos and gamma rays are dominated by other particles. Data analysis tool RapidMiner Feature selection MRMR Streams framework

44 Through-Put Performance of FACT Tools FACT records 60 events per second. Each events amounts to 3 Megabyte of raw data. 180MB/second are to be processed! Average processing time in milliseconds at a log scale shows the overall process ending with a classifier application.

45 Conclusion Data Analysis is needed in order to make good use of big data (in physics). We have seen the streams framework that processes telescope data in real-time. Choosing the right representation is the key to excellent results. We have seen a stable MRMR feature selection. We have seen a learning process producing features for learning unfolding. RapidMiner supports the overall cycle Select a method by a click! Store the process for documentation and exchange.

46

47 Further work The unfolding method could use other binning methods: Reduce entropy within a bin. Tauon neutrinos are hard to recognize, but we ll try! Real-time analysis could support the array of Cherenkov telescopes! S.G.Djorgovski (Caltech, USA) Astronomy has become an immensely data-rich field. There is a need for powerful DM/KDD tools.

48 SFB 876: A projects combine cyberphysical systems with big data analytics A1: new algorithms for ubiquitous systems, which are memory- and energy-aware, Integer probabilistic graphical models App prediction for energy savings A2: theoretical basis for memoryaware streaming clustering A3: methods from embedded systems for performance enhancements of R A4: platform for the analysis of resource consumption A6: Social network analysis Data stream analysis Exploiting sparseness Approximation Parallelism (GPU)

49 SFB 876: B projects focus on cyberphysical systems B1 Breath analysis B2 Virus detection B3 Industrie 4.0 Quality prediction Steel production B4 Traffic prognosis New Sensors Direct control Real-time application of learned models

50 SFB 876: C projects focus on very large data C1 High dimensional microarray data C3 Very high frequent astrophysical data C4 Regression for large scale high dimensional data C5 High capacity particle physics (CERN) Neuroblastoma outcome prediction Feature selection Storage and filtering

Monday Morning Data Mining

Monday Morning Data Mining Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik

More information

Olga Botner, Uppsala. Photo: Sven Lidström. Inspirationsdagar, 2015-03-17

Olga Botner, Uppsala. Photo: Sven Lidström. Inspirationsdagar, 2015-03-17 Olga Botner, Uppsala Photo: Sven Lidström Inspirationsdagar, 2015-03-17 OUTLINE OUTLINE WHY NEUTRINO ASTRONOMY? WHAT ARE NEUTRINOS? A CUBIC KILOMETER DETECTOR EXTRATERRESTRIAL NEUTRINOS WHAT NEXT? CONCLUSIONS

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014 Big Data Analytics An Introduction Oliver Fuchsberger University of Paderborn 2014 Table of Contents I. Introduction & Motivation What is Big Data Analytics? Why is it so important? II. Techniques & Solutions

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining

More information

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data 100 001 010 111 From Raw Data to 10011100 Actionable Insights with 00100111 MATLAB Analytics 01011100 11100001 1 Access and Explore Data For scientists the problem is not a lack of available but a deluge.

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

PoS(ICRC2015)865. FACT-Tools Streamed Real-Time Data Analysis

PoS(ICRC2015)865. FACT-Tools Streamed Real-Time Data Analysis K. A. Brügge b, M. L. Ahnen a, M. Balbo c, M. Bergmann d, A. Biland a, C. Bockermann e, T. Bretz a, J. Buss b, D. Dorner d, S. Einecke b, J. Freiwald b, C. Hempfling d, D. Hildebrand a, G. Hughes a, W.

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for

More information

Maschinelles Lernen mit MATLAB

Maschinelles Lernen mit MATLAB Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical

More information

Image Compression through DCT and Huffman Coding Technique

Image Compression through DCT and Huffman Coding Technique International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul

More information

Why is Internal Audit so Hard?

Why is Internal Audit so Hard? Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

More information

Robot Perception Continued

Robot Perception Continued Robot Perception Continued 1 Visual Perception Visual Odometry Reconstruction Recognition CS 685 11 Range Sensing strategies Active range sensors Ultrasound Laser range sensor Slides adopted from Siegwart

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification SIPAC Signals and Data Identification, Processing, Analysis, and Classification Framework for Mass Data Processing with Modules for Data Storage, Production and Configuration SIPAC key features SIPAC is

More information

High Productivity Data Processing Analytics Methods with Applications

High Productivity Data Processing Analytics Methods with Applications High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research

More information

Discovery of neutrino oscillations

Discovery of neutrino oscillations INSTITUTE OF PHYSICS PUBLISHING Rep. Prog. Phys. 69 (2006) 1607 1635 REPORTS ON PROGRESS IN PHYSICS doi:10.1088/0034-4885/69/6/r01 Discovery of neutrino oscillations Takaaki Kajita Research Center for

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

The accurate calibration of all detectors is crucial for the subsequent data

The accurate calibration of all detectors is crucial for the subsequent data Chapter 4 Calibration The accurate calibration of all detectors is crucial for the subsequent data analysis. The stability of the gain and offset for energy and time calibration of all detectors involved

More information

8.1 Radio Emission from Solar System objects

8.1 Radio Emission from Solar System objects 8.1 Radio Emission from Solar System objects 8.1.1 Moon and Terrestrial planets At visible wavelengths all the emission seen from these objects is due to light reflected from the sun. However at radio

More information

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Big Data Mining Services and Knowledge Discovery Applications on Clouds Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy [email protected] Data Availability or Data Deluge? Some decades

More information

COURSE RECOMMENDER SYSTEM IN E-LEARNING

COURSE RECOMMENDER SYSTEM IN E-LEARNING International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

Big Data. Introducción. Santiago González <[email protected]>

Big Data. Introducción. Santiago González <sgonzalez@fi.upm.es> Big Data Introducción Santiago González Contenidos Por que BIG DATA? Características de Big Data Tecnologías y Herramientas Big Data Paradigmas fundamentales Big Data Data Mining

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

The OPERA Emulsions. Jan Lenkeit. Hamburg Student Seminar, 12 June 2008. Institut für Experimentalphysik Forschungsgruppe Neutrinophysik

The OPERA Emulsions. Jan Lenkeit. Hamburg Student Seminar, 12 June 2008. Institut für Experimentalphysik Forschungsgruppe Neutrinophysik The OPERA Emulsions Jan Lenkeit Institut für Experimentalphysik Forschungsgruppe Neutrinophysik Hamburg Student Seminar, 12 June 2008 1/43 Outline The OPERA experiment Nuclear emulsions The OPERA emulsions

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Data Centric Systems (DCS)

Data Centric Systems (DCS) Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Learning from Big Data in

Learning from Big Data in Learning from Big Data in Astronomy an overview Kirk Borne George Mason University School of Physics, Astronomy, & Computational Sciences http://spacs.gmu.edu/ From traditional astronomy 2 to Big Data

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

Master of Science in Computer Science

Master of Science in Computer Science Master of Science in Computer Science Background/Rationale The MSCS program aims to provide both breadth and depth of knowledge in the concepts and techniques related to the theory, design, implementation,

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

SEIZE THE DATA. 2015 SEIZE THE DATA. 2015

SEIZE THE DATA. 2015 SEIZE THE DATA. 2015 1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. BIG DATA CONFERENCE 2015 Boston August 10-13 Predicting and reducing deforestation

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research [email protected]

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research [email protected] Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

Big Data Analytics. Chances and Challenges. Volker Markl

Big Data Analytics. Chances and Challenges. Volker Markl Volker Markl Professor and Chair Database Systems and Information Management (DIMA), Technische Universität Berlin www.dima.tu-berlin.de Big Data Analytics Chances and Challenges Volker Markl DIMA BDOD

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Nine Common Types of Data Mining Techniques Used in Predictive Analytics 1 Nine Common Types of Data Mining Techniques Used in Predictive Analytics By Laura Patterson, President, VisionEdge Marketing Predictive analytics enable you to develop mathematical models to help better

More information

Scalable Developments for Big Data Analytics in Remote Sensing

Scalable Developments for Big Data Analytics in Remote Sensing Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,

More information

Visualisatie BMT. Introduction, visualization, visualization pipeline. Arjan Kok Huub van de Wetering ([email protected])

Visualisatie BMT. Introduction, visualization, visualization pipeline. Arjan Kok Huub van de Wetering (h.v.d.wetering@tue.nl) Visualisatie BMT Introduction, visualization, visualization pipeline Arjan Kok Huub van de Wetering ([email protected]) 1 Lecture overview Goal Summary Study material What is visualization Examples

More information

Calorimetry in particle physics experiments

Calorimetry in particle physics experiments Calorimetry in particle physics experiments Unit n. 8 Calibration techniques Roberta Arcidiacono Lecture overview Introduction Hardware Calibration Test Beam Calibration In-situ Calibration (EM calorimeters)

More information

Big Data Text Mining and Visualization. Anton Heijs

Big Data Text Mining and Visualization. Anton Heijs Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark

More information

Neutron Stars. How were neutron stars discovered? The first neutron star was discovered by 24-year-old graduate student Jocelyn Bell in 1967.

Neutron Stars. How were neutron stars discovered? The first neutron star was discovered by 24-year-old graduate student Jocelyn Bell in 1967. Neutron Stars How were neutron stars discovered? The first neutron star was discovered by 24-year-old graduate student Jocelyn Bell in 1967. Using a radio telescope she noticed regular pulses of radio

More information

MS1b Statistical Data Mining

MS1b Statistical Data Mining MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to

More information

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet (@abifet)

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet (@abifet) HUAWEI Advanced Data Science with Spark Streaming Albert Bifet (@abifet) Huawei Noah s Ark Lab Focus Intelligent Mobile Devices Data Mining & Artificial Intelligence Intelligent Telecommunication Networks

More information

Statistical Challenges with Big Data in Management Science

Statistical Challenges with Big Data in Management Science Statistical Challenges with Big Data in Management Science Arnab Kumar Laha Indian Institute of Management Ahmedabad Analytics vs Reporting Competitive Advantage Reporting Prescriptive Analytics (Decision

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory 21 st Century Research Continuum Theory Theory embodied in computation Hypotheses tested through experiment SCIENTIFIC METHODS

More information

Ionospheric Research with the LOFAR Telescope

Ionospheric Research with the LOFAR Telescope Ionospheric Research with the LOFAR Telescope Leszek P. Błaszkiewicz Faculty of Mathematics and Computer Science, UWM Olsztyn LOFAR - The LOw Frequency ARray The LOFAR interferometer consist of a large

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Pentaho Data Mining Last Modified on January 22, 2007

Pentaho Data Mining Last Modified on January 22, 2007 Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

White Paper. How Streaming Data Analytics Enables Real-Time Decisions White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream

More information

PROGRAM DIRECTOR: Arthur O Connor Email Contact: URL : THE PROGRAM Careers in Data Analytics Admissions Criteria CURRICULUM Program Requirements

PROGRAM DIRECTOR: Arthur O Connor Email Contact: URL : THE PROGRAM Careers in Data Analytics Admissions Criteria CURRICULUM Program Requirements Data Analytics (MS) PROGRAM DIRECTOR: Arthur O Connor CUNY School of Professional Studies 101 West 31 st Street, 7 th Floor New York, NY 10001 Email Contact: Arthur O Connor, [email protected] URL:

More information

The VHE future. A. Giuliani

The VHE future. A. Giuliani The VHE future A. Giuliani 1 The Cherenkov Telescope Array Low energy Medium energy High energy 4.5o FoV 7o FoV 10o FoV 2000 pixels 2000 pixels 2000 pixels ~ 0.1 ~ 0.18 ~ 0.2-0.3 The Cherenkov Telescope

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Doctor of Philosophy in Computer Science

Doctor of Philosophy in Computer Science Doctor of Philosophy in Computer Science Background/Rationale The program aims to develop computer scientists who are armed with methods, tools and techniques from both theoretical and systems aspects

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

Big Data: Image & Video Analytics

Big Data: Image & Video Analytics Big Data: Image & Video Analytics How it could support Archiving & Indexing & Searching Dieter Haas, IBM Deutschland GmbH The Big Data Wave 60% of internet traffic is multimedia content (images and videos)

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Information about the T9 beam line and experimental facilities

Information about the T9 beam line and experimental facilities Information about the T9 beam line and experimental facilities The incoming proton beam from the PS accelerator impinges on the North target and thus produces the particles for the T9 beam line. The collisions

More information

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Machine Learning and Data Mining. Fundamentals, robotics, recognition Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information