Big Data Analytics in Astrophysics Katharina Morik, Informatik 8, TU Dortmund
Overview Short introduction to the Collaborative Research Center SFB 876 Zooming into one of its projects: C3 Astrophysical Data Analysis IceCube Magic, FACT Tools for data analysis, streaming data Science today is based on data. Data analysis is intrinsically tied to the scientific process. Active Galactic Nuclei
Collaborative Research Center SFB 876: Providing Information by Resource-Constrained Data Analysis Already granted: 2011-2018
SFB 876: Cyberphysical Systems Ubiquitous systems Smart phones Multimedia Navigation Support of elder people Sensor measurements Factories Medicine Science Goals: Analysis and prediction available: Anytime, Everywhere!
SFB 876: Big Data Analytics Statistical analysis: Carefully collected data Sophisticated analysis. Data mining: Large data sets gathered for other purposes than analysis Simple, reliable models. Big data analytics: Large volume, velocity, variety data New storage and processing architectures. Goals: Scalable algorithms, Algorithms for new architectures
SFB 876: Resource-aware algorithms Cyberphysical systems produce big data. Big data analytics delivers data summaries, predictions. Foundations Beyond runtime and sample complexity! Future: memory- and energyefficient analytics. Collaborative research center SFB 876 with 13 projects, 20 professors and about 50 Ph D students.
Christian Sohler, theoretical CS Sangkyun Lee, Data Analysis CS Olaf Spinczyk, Embedded Systems CS Alexander Schramm, Biomedicine Christian Wietfeld, Electrical Engin. Michael Schreckenberg, Traffic Physics Jochen Deuse, Machining Jian-Jia Chen, Embedded Systems CS Kristian Kersting, Data Analysis CS Katharina Morik, Data Analysis CS Heinrich Müller, Visual Analytics CS Petra Mutzel, Algorithm Engin. CS Jörg Rahnenführer, Statistics Peter Marwedel, Embedded Systems CS Katja Ickstadt, Statistics Tim Ruhe, Astrophysics Wolfgang Rhode, Astrophysics Bernhard Spaan, Particle physics Jens Teubner, Databases CS Michael Ten Hompel, Logistics Project leaders
Overview Short introduction to the Collaborative Research Center SFB 876 Zooming into one of its projects: C3 Astrophysical Data Analysis IceCube Magic, FACT Tools for data analysis, streaming data Science today is based on data. Data analysis is intrinsically tied to the scientific process. Active Galactic Nuclei
Big Data Analytics in Astrophysics High Energy Astrophysical Phenomena Neutrinos Gamma rays Instrumentation and Methods for Astrophysics Telescopes Data Analysis IceCube Collaboration Magic I, II, FACT Project C3 of SFB 876 Prof. Dr. Dr. Wolfgang Rhode Prof. Dr. Katharina Morik Dr. Tim Ruhe Big Data is Volume, Velocity, Variety. Big Data helps if rare events are outliers. Here, we are looking for the needle in the haystack: neutrinos and gamma rays are dominated by other particles. Data analysis tool RapidMiner Feature selection MRMR Streams framework
Structure of the remaining talk the data analysis cycle Data Analysis is not just one step in the overall scientific investigation. Data Analysis accompanies the scientific investigation. Data Analytics is interdisciplinary in nature: Physics Statistics Data management Machine learning Software/Algorithm engineering Complexity Theory Insight Physics questions
Data Analysis Tools - Requirements Coverage: Support the overall cycle Amenability: Easy to use for all involved scientists Rapidity: Fast creation of analysis process Comparability: international scientific terminology Reproducibility: storing the analysis process with all parameters in a small format. Not just the modeling step. Not just for statisticians, computer scientists, physicists but for all of them. Not a bunch of programs/libraries to be used in programming. Not a new name with a little variance for known methods that have a theoretic basis. Not asking the programmer whether he remembers which parameters succeeded.
RapidMiner Coverage: Support the overall cycle Amenability: Easy to use for all involved scientists Rapidity: Fast creation of analysis process Comparability: international scientific terminology Reproducibility: storing the analysis process with all parameters in a small format, re-run, exchange. If 115 data transformations and 264 modeling operators are not enough: Write your own operator and plug it into RapidMiner!
Physics Questions Explaining Birth and Death of Stars What happens when particles collide and how, exactly? High Energy Astrophysical Phenomena Neutrinos Gamma rays Instrumentation and Methods for Astrophysics Telescopes Data Analysis Insight Physics questions
Death of a star 1054 Chinese astronomers detect a supernova. Its remnant is the crab nebula. 1967 Jocelyn Bell, Antony Hewish discover a pulsar, which comes from a rotating neutron star. The heat of the dying star has turned electrons and protons to neutrons. Neutron stars are composed of neutrons. This releases a flux of neutrinos. Neutron stars may produce gamma-rays.
Neutrino genesis b - - decay: n p + e - + n e Neutron becomes Proton Electron, Anti-Electron- Neutrino b + - decay: p n + e + + n e Proton becomes Neutron Positron, Electron-Neutrino Neutrinos accompanying Electron Mu-meson (myon or muon) Tauon Myon neutrino collides with proton of hydrogen resulting in a positive pion and a negative myon.
Physics Questions Explaining Birth and Death of Stars What happens when particles collide and how, exactly? High Energy Astrophysical Phenomena Neutrinos Gamma rays Instrumentation and Methods for Astrophysics Telescopes Data Analysis Insight Physics questions
Faster than light Cherenkov radiation has been observed by Marie Curie and interpreted by Pawel Cherenkov. If a loaded particle in a medium moves faster than light, the blue Cherenkov light shows up. The medium can be: Water Ice Air. A "faster than light" neutrino discovery was actually the result of a loose cable. So the scientific community can rest easy. CERN sent a beam of neutrinos to the Gran Sasso laboratory, OPERA.
Breakthrough of the year 2013: IceCube The IceCube project has been awarded the 2013 Breakthrough of the Year by the British magazine Physics World. Detection of 28 neutrinos from outside the Milky Way 2 neutrinos with 1 PeV (Ernie and Bert), one even 2 PeV (Big Bird). Model the distribution of all neutrinos (frequency of neutrinos at all energies)! Separate neutrinos from other particles. Interaction in detector cosmic neutrinos IceCube collaboration: 36 research institutes in 8 countries, Francis Halzen, University of Wisconsin-Madison, Olga Botner, University of Uppsala (speakers) Atmospheric neutrinos IceCube Magic Interaction in orbit proton
Magic and FACT Cherenkov Telescopes Air shower is visible in the camera for about 150 Nanoseconds. 1440 pixels of the camera sampled at 2 GHz rate. Hardware filter decides about storage. Records of sampled voltages of camera pixels. Model the distribution of gamma rays Separate gamma from other rays
Data understanding What do the data look like? How to access the data? Do we need real-time access? Physics questions Data understanding Insight
Data volume challenging data analysis IceCube at the South Pole: within the holes, there are Digital Optical Modules (DOM) mounted, that record Cherenkov light. The trace of muons through the 3D array of DOMs shows also the direction of incoming neutrinos. 1 Terabyte per day = 1 Mio. Gigabyte A satellite takes about 10 years to transport the data of a year to the University Wisconsin. A ship transports the data on hard discs in 28 days. No real-time analysis. Ships are faster than satellites for data transmission.
Skewed distribution challenging data analysis Calibration, cleaning Feature extraction Signal separation Energy estimation A simulator provides labeled observations. Gamma rays of high energy are rare events as opposed to hadrons, ratio 1 to 1000. MAGIC I (2003) and MAGIC II (2009) La Palma, Roque de los Muchachos FACT telescope, same type, same place
FACT-Viewer
Streams framework for real-time processing Streams framework offers highlevel view of streaming data (Bockermann,Blom 2012). There is a plugin for RapidMiner. It incorporates MOA for the analysis of streaming data. It compiles into Storm for execution, if wanted. The Fact-tools are a set of Java classes that provide an extension to the streams framework for reading and processing data from the FACT telescope.
Data understanding what is to be learned? IceCube: Overall spectrum of atmospheric neutrinos from 100 Giga ev to 1 Peta ev demands better background rejection. MAGIC, FACT: Better separation of hadrons and gamma rays. Reproducible results. Insight Physics questions Data understanding Tim Ruhe, Katharina Morik, Wolfgang Rhode Application of RapidMiner in Neutrino Astronomy in: Hofmann, Klinkenberg (eds) RapidMiner Data Mining Use Cases and Business Analytics Applications, CRC Press 2014 Mark Aartsen Katharina Morik, Wolfgang Rhode, Tim Ruhe Development of a general Analysis and Unfolding Scheme and its Application to Measure the Energy Spectrum of Atmospheric Neutrinos with IceCube in: Eur.Phys.Journal 2014
Data preparation training samples and features Samples Balanced stratified sample enhances learning. Training: 27 000 signal 27 000 neutrino events Labeling by simulation. IceCube 59 strings measurements of 346 days 195 321 860 raw events 16 983 100 after manual cuts No true labels given, but estimate according to distribution of neutrinos. Insight Physics questions Data preparation
Data preparation Feature selection Too many features hide the true pattern and slow down modeling. Redundant features may lead to wrong results. Wrong: Two features with the same meaning two times an impact. Right: Each feature should have half an impact. Quality of a feature set is given by the performance of learning using the feature set. Insight Physics questions Data preparation
Q -- MRMR Minimum Redundancy Maximum Relevance (MRMR) Start with empty feature set. Add the one with lowest redundancy to already chosen features D(x,x) and highest relevance w.r.t. label R(x,y) Efficient implementation by Benjamin Schowe in RapidMiner s Feature Selection Extension 1 Q R( x, y) D( x, x) j x in F j Mark Hall Correlation Based Feature Selection 1999
Feature selection in RapidMiner 2
Overfitting to a sample? Measuring stability! Split the training data into m sets. Do feature selection on m- 1 sets. Use the selected features for learning and test the learner s performance. We do that m times. One sample leads to the feature set A, another to the feature set B, both select k features out of n given ones. Jaccard Index J A B A B
Feature transformation and extraction Right representation eases understanding and modeling, e.g., time series: deterministic, with outlier, level change y t y t time t y t+1 y t HR t timet y t+1 y t U.Gather, M. Bauer time t y t
Feature selection and extraction in IceCube 200 features given, diverse constructions over raw data. Feature selection using MRMR particularly successful: Redundant features in the data, e.g., the zenith angle or its consequence is obtained from several reconstruction algorithms. 25 features selected with stability over 0.8. Creating features by binning (spectrum unfolding) and learning Tim Ruhe s method. Novel spectrum unfolding: Partition attribute values into bins (intervals) Intervals become labled classes Run multiclass learning output confidence of each class for each example ex 1 ex n bin 1 bin r Neutrino? conf Learn classifier from these data.
Binning in RapidMiner Numerical data Discretized into 4 bins Discretized to equal frequency bins
Modeling tasks classification and regression Given observations x with labels y {(x 1,y 1 ), (x n,y n )} with binominal y (classification) with real-valued y (regression) Find f(x)=y such that the error is minimized. RapidMiner offers 167 learning algorithms for classification and regression. IceCube: y in {neutrino, not neutrino} Algorithm should be robust, scalable, parallel! Insight Physics questions Modeling
Types of Models Lazy Modeling, Local Models K-NeirestNeighbors Additive Models Decision Trees Linear Models Linear Regression Support Vector Machine Bayesian Models
Ensemble Methods here: for decision trees Ensemble: Take many models and decide according to the majority vote! Algorithm Training: For l decision trees (parallel): (1) Take a sample from the data (2) Take a subset of features (3) Choose the best feature according to minimal entropy Split according to feature If purity not ok, goto (3). Weka Random Forest in RapidMiner parallel Breiman 2001, Machine Learning Journal
Evaluation Training and testing on different samples is necessary in order to estimate the true error. Testing on the same set would overestimate the quality of the model. Best test: leave one out requires for n observations n runs of modeling. This is not efficient. Crossvalidation: split into m partitions, use m-1 for training and test on the unused one. Output the average of all tests. Fair. Insight Physics questions Evaluation
Cross validation in RapidMiner Your measurement is meaningless without knowledge of the error. Walter Levin 1 2 2
IceCube Results We enhanced the neutrino recognition by 62% (IceCube collaboration and Morik 2014). Quality cuts lead from full data D to D, rejecting the easy 91.4% of background. Random Forest leads from D to D so that 99.9999% of background muons are rejected. At this background rejection 27 771 atmospheric neutrino events were detected in 346 days of IceCube 59. Insight Physics questions
IceCube Results For the first time, the spectrum of atmospheric neutrinos could be reconstructed up to an energy level 1 PeV. Now, theoretical astrophysics can build upon this. Physics questions Insight
Empirical work for theory development -9 10-8 10-7 10-6 10 2 En -5 10-2 [GeV cm F n -4 10-3 10 s -1-2 10 sr -1 ] -1 10-1 0 1 2 3 4 5 6 7 log (E [GeV]) 10 n IC-59 n m Unfolding AMANDA n m unfolding IC-79 n e flux Gaisser H3a Gaisser H3a No Charm Frejus n m Frejus n e Frejus n m model Frejus n e model Honda n e Insight Physics questions Evaluation
Interdisciplinary work
Big Data Analytics in Astrophysics High Energy Astrophysical Phenomena Neutrinos Gamma rays Instrumentation and Methods for Astrophysics Telescopes Data Analysis IceCube Collaboration Magic I, II, FACT Project C3 of SFB 876 Prof. Dr. Dr. Wolfgang Rhode Prof. Dr. Katharina Morik Dr. Tim Ruhe Big Data is Volume, Velocity, Variety. Big Data helps if rare events are outliers. Here, we are looking for the needle in the haystack: neutrinos and gamma rays are dominated by other particles. Data analysis tool RapidMiner Feature selection MRMR Streams framework
Through-Put Performance of FACT Tools FACT records 60 events per second. Each events amounts to 3 Megabyte of raw data. 180MB/second are to be processed! Average processing time in milliseconds at a log scale shows the overall process ending with a classifier application.
Conclusion Data Analysis is needed in order to make good use of big data (in physics). We have seen the streams framework that processes telescope data in real-time. Choosing the right representation is the key to excellent results. We have seen a stable MRMR feature selection. We have seen a learning process producing features for learning unfolding. RapidMiner supports the overall cycle Select a method by a click! Store the process for documentation and exchange.
Further work The unfolding method could use other binning methods: Reduce entropy within a bin. Tauon neutrinos are hard to recognize, but we ll try! Real-time analysis could support the array of Cherenkov telescopes! S.G.Djorgovski (Caltech, USA) Astronomy has become an immensely data-rich field. There is a need for powerful DM/KDD tools.
SFB 876: A projects combine cyberphysical systems with big data analytics A1: new algorithms for ubiquitous systems, which are memory- and energy-aware, Integer probabilistic graphical models App prediction for energy savings A2: theoretical basis for memoryaware streaming clustering A3: methods from embedded systems for performance enhancements of R A4: platform for the analysis of resource consumption A6: Social network analysis Data stream analysis Exploiting sparseness Approximation Parallelism (GPU)
SFB 876: B projects focus on cyberphysical systems B1 Breath analysis B2 Virus detection B3 Industrie 4.0 Quality prediction Steel production B4 Traffic prognosis New Sensors Direct control Real-time application of learned models
SFB 876: C projects focus on very large data C1 High dimensional microarray data C3 Very high frequent astrophysical data C4 Regression for large scale high dimensional data C5 High capacity particle physics (CERN) Neuroblastoma outcome prediction Feature selection Storage and filtering