Big Data Analytics in Astrophysics. Katharina Morik, Informatik 8, TU Dortmund



Similar documents
Monday Morning Data Mining

Olga Botner, Uppsala. Photo: Sven Lidström. Inspirationsdagar,

Information Management course

Introduction to Data Mining

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

The Scientific Data Mining Process

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Sanjeev Kumar. contribute

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

PoS(ICRC2015)865. FACT-Tools Streamed Real-Time Data Analysis

Social Media Mining. Data Mining Essentials

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining. Nonlinear Classification

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Data Mining Practical Machine Learning Tools and Techniques

Data Mining for Knowledge Management. Classification

Azure Machine Learning, SQL Data Mining and R

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Maschinelles Lernen mit MATLAB

Image Compression through DCT and Huffman Coding Technique

Why is Internal Audit so Hard?

Robot Perception Continued

Fast Analytics on Big Data with H20

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Advanced analytics at your hands

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

High Productivity Data Processing Analytics Methods with Applications

Discovery of neutrino oscillations

Advanced In-Database Analytics

Data Mining Algorithms Part 1. Dejan Sarka

The accurate calibration of all detectors is crucial for the subsequent data

8.1 Radio Emission from Solar System objects

Big Data Mining Services and Knowledge Discovery Applications on Clouds

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Statistics for BIG data

Big Data. Introducción. Santiago González

Comparison of Data Mining Techniques used for Financial Data Analysis

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

The OPERA Emulsions. Jan Lenkeit. Hamburg Student Seminar, 12 June Institut für Experimentalphysik Forschungsgruppe Neutrinophysik

Distributed forests for MapReduce-based machine learning

Data Centric Systems (DCS)

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Learning from Big Data in

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Master of Science in Computer Science

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

SEIZE THE DATA SEIZE THE DATA. 2015

Knowledge Discovery and Data Mining

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Getting Even More Out of Ensemble Selection

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Big Data Analytics. Chances and Challenges. Volker Markl

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Scalable Developments for Big Data Analytics in Remote Sensing

Visualisatie BMT. Introduction, visualization, visualization pipeline. Arjan Kok Huub van de Wetering

Calorimetry in particle physics experiments

Big Data Text Mining and Visualization. Anton Heijs

Neutron Stars. How were neutron stars discovered? The first neutron star was discovered by 24-year-old graduate student Jocelyn Bell in 1967.

MS1b Statistical Data Mining

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Statistical Challenges with Big Data in Management Science

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Data Mining - Evaluation of Classifiers

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Ionospheric Research with the LOFAR Telescope

Data Mining: Overview. What is Data Mining?

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Supervised Learning (Big Data Analytics)

Pentaho Data Mining Last Modified on January 22, 2007

Active Learning SVM for Blogs recommendation

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

PROGRAM DIRECTOR: Arthur O Connor Contact: URL : THE PROGRAM Careers in Data Analytics Admissions Criteria CURRICULUM Program Requirements

The VHE future. A. Giuliani

REVIEW OF ENSEMBLE CLASSIFICATION

Data, Measurements, Features

An Overview of Knowledge Discovery Database and Data mining Techniques

Doctor of Philosophy in Computer Science

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Big Data: Image & Video Analytics

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

BIG DATA What it is and how to use?

Information about the T9 beam line and experimental facilities

Machine Learning and Data Mining. Fundamentals, robotics, recognition

not possible or was possible at a high cost for collecting the data.

Transcription:

Big Data Analytics in Astrophysics Katharina Morik, Informatik 8, TU Dortmund

Overview Short introduction to the Collaborative Research Center SFB 876 Zooming into one of its projects: C3 Astrophysical Data Analysis IceCube Magic, FACT Tools for data analysis, streaming data Science today is based on data. Data analysis is intrinsically tied to the scientific process. Active Galactic Nuclei

Collaborative Research Center SFB 876: Providing Information by Resource-Constrained Data Analysis Already granted: 2011-2018

SFB 876: Cyberphysical Systems Ubiquitous systems Smart phones Multimedia Navigation Support of elder people Sensor measurements Factories Medicine Science Goals: Analysis and prediction available: Anytime, Everywhere!

SFB 876: Big Data Analytics Statistical analysis: Carefully collected data Sophisticated analysis. Data mining: Large data sets gathered for other purposes than analysis Simple, reliable models. Big data analytics: Large volume, velocity, variety data New storage and processing architectures. Goals: Scalable algorithms, Algorithms for new architectures

SFB 876: Resource-aware algorithms Cyberphysical systems produce big data. Big data analytics delivers data summaries, predictions. Foundations Beyond runtime and sample complexity! Future: memory- and energyefficient analytics. Collaborative research center SFB 876 with 13 projects, 20 professors and about 50 Ph D students.

Christian Sohler, theoretical CS Sangkyun Lee, Data Analysis CS Olaf Spinczyk, Embedded Systems CS Alexander Schramm, Biomedicine Christian Wietfeld, Electrical Engin. Michael Schreckenberg, Traffic Physics Jochen Deuse, Machining Jian-Jia Chen, Embedded Systems CS Kristian Kersting, Data Analysis CS Katharina Morik, Data Analysis CS Heinrich Müller, Visual Analytics CS Petra Mutzel, Algorithm Engin. CS Jörg Rahnenführer, Statistics Peter Marwedel, Embedded Systems CS Katja Ickstadt, Statistics Tim Ruhe, Astrophysics Wolfgang Rhode, Astrophysics Bernhard Spaan, Particle physics Jens Teubner, Databases CS Michael Ten Hompel, Logistics Project leaders

Overview Short introduction to the Collaborative Research Center SFB 876 Zooming into one of its projects: C3 Astrophysical Data Analysis IceCube Magic, FACT Tools for data analysis, streaming data Science today is based on data. Data analysis is intrinsically tied to the scientific process. Active Galactic Nuclei

Big Data Analytics in Astrophysics High Energy Astrophysical Phenomena Neutrinos Gamma rays Instrumentation and Methods for Astrophysics Telescopes Data Analysis IceCube Collaboration Magic I, II, FACT Project C3 of SFB 876 Prof. Dr. Dr. Wolfgang Rhode Prof. Dr. Katharina Morik Dr. Tim Ruhe Big Data is Volume, Velocity, Variety. Big Data helps if rare events are outliers. Here, we are looking for the needle in the haystack: neutrinos and gamma rays are dominated by other particles. Data analysis tool RapidMiner Feature selection MRMR Streams framework

Structure of the remaining talk the data analysis cycle Data Analysis is not just one step in the overall scientific investigation. Data Analysis accompanies the scientific investigation. Data Analytics is interdisciplinary in nature: Physics Statistics Data management Machine learning Software/Algorithm engineering Complexity Theory Insight Physics questions

Data Analysis Tools - Requirements Coverage: Support the overall cycle Amenability: Easy to use for all involved scientists Rapidity: Fast creation of analysis process Comparability: international scientific terminology Reproducibility: storing the analysis process with all parameters in a small format. Not just the modeling step. Not just for statisticians, computer scientists, physicists but for all of them. Not a bunch of programs/libraries to be used in programming. Not a new name with a little variance for known methods that have a theoretic basis. Not asking the programmer whether he remembers which parameters succeeded.

RapidMiner Coverage: Support the overall cycle Amenability: Easy to use for all involved scientists Rapidity: Fast creation of analysis process Comparability: international scientific terminology Reproducibility: storing the analysis process with all parameters in a small format, re-run, exchange. If 115 data transformations and 264 modeling operators are not enough: Write your own operator and plug it into RapidMiner!

Physics Questions Explaining Birth and Death of Stars What happens when particles collide and how, exactly? High Energy Astrophysical Phenomena Neutrinos Gamma rays Instrumentation and Methods for Astrophysics Telescopes Data Analysis Insight Physics questions

Death of a star 1054 Chinese astronomers detect a supernova. Its remnant is the crab nebula. 1967 Jocelyn Bell, Antony Hewish discover a pulsar, which comes from a rotating neutron star. The heat of the dying star has turned electrons and protons to neutrons. Neutron stars are composed of neutrons. This releases a flux of neutrinos. Neutron stars may produce gamma-rays.

Neutrino genesis b - - decay: n p + e - + n e Neutron becomes Proton Electron, Anti-Electron- Neutrino b + - decay: p n + e + + n e Proton becomes Neutron Positron, Electron-Neutrino Neutrinos accompanying Electron Mu-meson (myon or muon) Tauon Myon neutrino collides with proton of hydrogen resulting in a positive pion and a negative myon.

Physics Questions Explaining Birth and Death of Stars What happens when particles collide and how, exactly? High Energy Astrophysical Phenomena Neutrinos Gamma rays Instrumentation and Methods for Astrophysics Telescopes Data Analysis Insight Physics questions

Faster than light Cherenkov radiation has been observed by Marie Curie and interpreted by Pawel Cherenkov. If a loaded particle in a medium moves faster than light, the blue Cherenkov light shows up. The medium can be: Water Ice Air. A "faster than light" neutrino discovery was actually the result of a loose cable. So the scientific community can rest easy. CERN sent a beam of neutrinos to the Gran Sasso laboratory, OPERA.

Breakthrough of the year 2013: IceCube The IceCube project has been awarded the 2013 Breakthrough of the Year by the British magazine Physics World. Detection of 28 neutrinos from outside the Milky Way 2 neutrinos with 1 PeV (Ernie and Bert), one even 2 PeV (Big Bird). Model the distribution of all neutrinos (frequency of neutrinos at all energies)! Separate neutrinos from other particles. Interaction in detector cosmic neutrinos IceCube collaboration: 36 research institutes in 8 countries, Francis Halzen, University of Wisconsin-Madison, Olga Botner, University of Uppsala (speakers) Atmospheric neutrinos IceCube Magic Interaction in orbit proton

Magic and FACT Cherenkov Telescopes Air shower is visible in the camera for about 150 Nanoseconds. 1440 pixels of the camera sampled at 2 GHz rate. Hardware filter decides about storage. Records of sampled voltages of camera pixels. Model the distribution of gamma rays Separate gamma from other rays

Data understanding What do the data look like? How to access the data? Do we need real-time access? Physics questions Data understanding Insight

Data volume challenging data analysis IceCube at the South Pole: within the holes, there are Digital Optical Modules (DOM) mounted, that record Cherenkov light. The trace of muons through the 3D array of DOMs shows also the direction of incoming neutrinos. 1 Terabyte per day = 1 Mio. Gigabyte A satellite takes about 10 years to transport the data of a year to the University Wisconsin. A ship transports the data on hard discs in 28 days. No real-time analysis. Ships are faster than satellites for data transmission.

Skewed distribution challenging data analysis Calibration, cleaning Feature extraction Signal separation Energy estimation A simulator provides labeled observations. Gamma rays of high energy are rare events as opposed to hadrons, ratio 1 to 1000. MAGIC I (2003) and MAGIC II (2009) La Palma, Roque de los Muchachos FACT telescope, same type, same place

FACT-Viewer

Streams framework for real-time processing Streams framework offers highlevel view of streaming data (Bockermann,Blom 2012). There is a plugin for RapidMiner. It incorporates MOA for the analysis of streaming data. It compiles into Storm for execution, if wanted. The Fact-tools are a set of Java classes that provide an extension to the streams framework for reading and processing data from the FACT telescope.

Data understanding what is to be learned? IceCube: Overall spectrum of atmospheric neutrinos from 100 Giga ev to 1 Peta ev demands better background rejection. MAGIC, FACT: Better separation of hadrons and gamma rays. Reproducible results. Insight Physics questions Data understanding Tim Ruhe, Katharina Morik, Wolfgang Rhode Application of RapidMiner in Neutrino Astronomy in: Hofmann, Klinkenberg (eds) RapidMiner Data Mining Use Cases and Business Analytics Applications, CRC Press 2014 Mark Aartsen Katharina Morik, Wolfgang Rhode, Tim Ruhe Development of a general Analysis and Unfolding Scheme and its Application to Measure the Energy Spectrum of Atmospheric Neutrinos with IceCube in: Eur.Phys.Journal 2014

Data preparation training samples and features Samples Balanced stratified sample enhances learning. Training: 27 000 signal 27 000 neutrino events Labeling by simulation. IceCube 59 strings measurements of 346 days 195 321 860 raw events 16 983 100 after manual cuts No true labels given, but estimate according to distribution of neutrinos. Insight Physics questions Data preparation

Data preparation Feature selection Too many features hide the true pattern and slow down modeling. Redundant features may lead to wrong results. Wrong: Two features with the same meaning two times an impact. Right: Each feature should have half an impact. Quality of a feature set is given by the performance of learning using the feature set. Insight Physics questions Data preparation

Q -- MRMR Minimum Redundancy Maximum Relevance (MRMR) Start with empty feature set. Add the one with lowest redundancy to already chosen features D(x,x) and highest relevance w.r.t. label R(x,y) Efficient implementation by Benjamin Schowe in RapidMiner s Feature Selection Extension 1 Q R( x, y) D( x, x) j x in F j Mark Hall Correlation Based Feature Selection 1999

Feature selection in RapidMiner 2

Overfitting to a sample? Measuring stability! Split the training data into m sets. Do feature selection on m- 1 sets. Use the selected features for learning and test the learner s performance. We do that m times. One sample leads to the feature set A, another to the feature set B, both select k features out of n given ones. Jaccard Index J A B A B

Feature transformation and extraction Right representation eases understanding and modeling, e.g., time series: deterministic, with outlier, level change y t y t time t y t+1 y t HR t timet y t+1 y t U.Gather, M. Bauer time t y t

Feature selection and extraction in IceCube 200 features given, diverse constructions over raw data. Feature selection using MRMR particularly successful: Redundant features in the data, e.g., the zenith angle or its consequence is obtained from several reconstruction algorithms. 25 features selected with stability over 0.8. Creating features by binning (spectrum unfolding) and learning Tim Ruhe s method. Novel spectrum unfolding: Partition attribute values into bins (intervals) Intervals become labled classes Run multiclass learning output confidence of each class for each example ex 1 ex n bin 1 bin r Neutrino? conf Learn classifier from these data.

Binning in RapidMiner Numerical data Discretized into 4 bins Discretized to equal frequency bins

Modeling tasks classification and regression Given observations x with labels y {(x 1,y 1 ), (x n,y n )} with binominal y (classification) with real-valued y (regression) Find f(x)=y such that the error is minimized. RapidMiner offers 167 learning algorithms for classification and regression. IceCube: y in {neutrino, not neutrino} Algorithm should be robust, scalable, parallel! Insight Physics questions Modeling

Types of Models Lazy Modeling, Local Models K-NeirestNeighbors Additive Models Decision Trees Linear Models Linear Regression Support Vector Machine Bayesian Models

Ensemble Methods here: for decision trees Ensemble: Take many models and decide according to the majority vote! Algorithm Training: For l decision trees (parallel): (1) Take a sample from the data (2) Take a subset of features (3) Choose the best feature according to minimal entropy Split according to feature If purity not ok, goto (3). Weka Random Forest in RapidMiner parallel Breiman 2001, Machine Learning Journal

Evaluation Training and testing on different samples is necessary in order to estimate the true error. Testing on the same set would overestimate the quality of the model. Best test: leave one out requires for n observations n runs of modeling. This is not efficient. Crossvalidation: split into m partitions, use m-1 for training and test on the unused one. Output the average of all tests. Fair. Insight Physics questions Evaluation

Cross validation in RapidMiner Your measurement is meaningless without knowledge of the error. Walter Levin 1 2 2

IceCube Results We enhanced the neutrino recognition by 62% (IceCube collaboration and Morik 2014). Quality cuts lead from full data D to D, rejecting the easy 91.4% of background. Random Forest leads from D to D so that 99.9999% of background muons are rejected. At this background rejection 27 771 atmospheric neutrino events were detected in 346 days of IceCube 59. Insight Physics questions

IceCube Results For the first time, the spectrum of atmospheric neutrinos could be reconstructed up to an energy level 1 PeV. Now, theoretical astrophysics can build upon this. Physics questions Insight

Empirical work for theory development -9 10-8 10-7 10-6 10 2 En -5 10-2 [GeV cm F n -4 10-3 10 s -1-2 10 sr -1 ] -1 10-1 0 1 2 3 4 5 6 7 log (E [GeV]) 10 n IC-59 n m Unfolding AMANDA n m unfolding IC-79 n e flux Gaisser H3a Gaisser H3a No Charm Frejus n m Frejus n e Frejus n m model Frejus n e model Honda n e Insight Physics questions Evaluation

Interdisciplinary work

Big Data Analytics in Astrophysics High Energy Astrophysical Phenomena Neutrinos Gamma rays Instrumentation and Methods for Astrophysics Telescopes Data Analysis IceCube Collaboration Magic I, II, FACT Project C3 of SFB 876 Prof. Dr. Dr. Wolfgang Rhode Prof. Dr. Katharina Morik Dr. Tim Ruhe Big Data is Volume, Velocity, Variety. Big Data helps if rare events are outliers. Here, we are looking for the needle in the haystack: neutrinos and gamma rays are dominated by other particles. Data analysis tool RapidMiner Feature selection MRMR Streams framework

Through-Put Performance of FACT Tools FACT records 60 events per second. Each events amounts to 3 Megabyte of raw data. 180MB/second are to be processed! Average processing time in milliseconds at a log scale shows the overall process ending with a classifier application.

Conclusion Data Analysis is needed in order to make good use of big data (in physics). We have seen the streams framework that processes telescope data in real-time. Choosing the right representation is the key to excellent results. We have seen a stable MRMR feature selection. We have seen a learning process producing features for learning unfolding. RapidMiner supports the overall cycle Select a method by a click! Store the process for documentation and exchange.

Further work The unfolding method could use other binning methods: Reduce entropy within a bin. Tauon neutrinos are hard to recognize, but we ll try! Real-time analysis could support the array of Cherenkov telescopes! S.G.Djorgovski (Caltech, USA) Astronomy has become an immensely data-rich field. There is a need for powerful DM/KDD tools.

SFB 876: A projects combine cyberphysical systems with big data analytics A1: new algorithms for ubiquitous systems, which are memory- and energy-aware, Integer probabilistic graphical models App prediction for energy savings A2: theoretical basis for memoryaware streaming clustering A3: methods from embedded systems for performance enhancements of R A4: platform for the analysis of resource consumption A6: Social network analysis Data stream analysis Exploiting sparseness Approximation Parallelism (GPU)

SFB 876: B projects focus on cyberphysical systems B1 Breath analysis B2 Virus detection B3 Industrie 4.0 Quality prediction Steel production B4 Traffic prognosis New Sensors Direct control Real-time application of learned models

SFB 876: C projects focus on very large data C1 High dimensional microarray data C3 Very high frequent astrophysical data C4 Regression for large scale high dimensional data C5 High capacity particle physics (CERN) Neuroblastoma outcome prediction Feature selection Storage and filtering