Statistical Distributions in Astronomy

Similar documents
Foundation of Quantitative Data Analysis

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

AMS 5 CHANCE VARIABILITY

Data Exploration Data Visualization

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Provided: A formula sheet and table of physical constants is attached to this paper. DARK MATTER AND THE UNIVERSE

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Azure Machine Learning, SQL Data Mining and R

Data Mining and Visualization

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Exploratory Data Analysis

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

RESULTS FROM A SIMPLE INFRARED CLOUD DETECTOR

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

5. The Nature of Light. Does Light Travel Infinitely Fast? EMR Travels At Finite Speed. EMR: Electric & Magnetic Waves

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

7. Normal Distributions

Take away concepts. What is Energy? Solar Energy. EM Radiation. Properties of waves. Solar Radiation Emission and Absorption

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray

Northumberland Knowledge

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Lecture 8: Signal Detection and Noise Assumption

MTH 140 Statistics Videos

Preview of Period 3: Electromagnetic Waves Radiant Energy II

Fairfield Public Schools

Unit 1: Introduction to Quality Management

Origins of the Cosmos Summer Pre-course assessment

John Kerrich s coin-tossing Experiment. Law of Averages - pg. 294 Moore s Text

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Week 1. Exploratory Data Analysis

Blackbody Radiation References INTRODUCTION

Implications of Big Data for Statistics Instruction 17 Nov 2013

Didacticiel Études de cas

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Univariate Regression

Light as a Wave. The Nature of Light. EM Radiation Spectrum. EM Radiation Spectrum. Electromagnetic Radiation

Grade 6 Standard 3 Unit Test A Astronomy. 1. The four inner planets are rocky and small. Which description best fits the next four outer planets?

Dealing with large datasets

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Predicting Flight Delays

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Ellipticals. Elliptical galaxies: Elliptical galaxies: Some ellipticals are not so simple M89 E0

COOKBOOK. for. Aristarchos Transient Spectrometer (ATS)

Descriptive Statistics

Basic Probability and Statistics Review. Six Sigma Black Belt Primer

Homework #4 Solutions ASTR100: Introduction to Astronomy Fall 2009: Dr. Stacy McGaugh

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Exploratory data analysis (Chapter 2) Fall 2011

GeoGebra Statistics and Probability

Statistics Review PSY379

The Sun. Solar radiation (Sun Earth-Relationships) The Sun. The Sun. Our Sun

Physics 1010: The Physics of Everyday Life. TODAY Black Body Radiation, Greenhouse Effect

Light. What is light?

AMS 7L LAB #2 Spring, Exploratory Data Analysis

Some Essential Statistics The Lure of Statistics

Mathematics within the Psychology Curriculum

Faber-Jackson relation: Fundamental Plane: Faber-Jackson Relation

MEASURES OF VARIATION

Solar Energy. Outline. Solar radiation. What is light?-- Electromagnetic Radiation. Light - Electromagnetic wave spectrum. Electromagnetic Radiation

Nuclear Physics Lab I: Geiger-Müller Counter and Nuclear Counting Statistics

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Module 4: Data Exploration

6.4 Normal Distribution

Lecture 1: Review and Exploratory Data Analysis (EDA)

UNIT 1: COLLECTING DATA

Demonstration of Data Analysis using the Gnumeric Spreadsheet Solver to Estimate the Period for Solar Rotation

Virtual Met Mast verification report:

Anomaly Detection and Predictive Maintenance

product overview pco.edge family the most versatile scmos camera portfolio on the market pioneer in scmos image sensor technology

Probability and Statistics

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Dongfeng Li. Autumn 2010

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

430 Statistics and Financial Mathematics for Business

EXPLORING SPATIAL PATTERNS IN YOUR DATA

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

SSO Transmission Grating Spectrograph (TGS) User s Guide

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Science Standard 4 Earth in Space Grade Level Expectations

Transcription:

Data Mining In Modern Astronomy Sky Surveys: Statistical Distributions in Astronomy Ching-Wa Yip cwyip@pha.jhu.edu; Bloomberg 518

From Data to Information We don t just want data. We want information from the data. Information Database Sensors Data Analysis or Data Mining

Why Do We Use Statistics to Analyze Data? Describe the data: What is the mean, median, mode? What is the standard deviation? What is the distribution? What are the outliers? Find trends in the data: Is there a correlation between two variables? How well are they correlated? What is the predicted value of a variable?

R A free statistical software. Many packages available for performing from basic to advanced statistical analyses. Popular in multiple disciplines. Popular in industry.

Interactive Data Language (IDL) A programming language for data analysis and plotting. Many Procedures for manipulating FITS file. Popular among astronomers. (Age map of a galaxy.)

Other Programming Languages & Resources Python (free; getting popular among astronomers) Matlab C/C++/C# Java Numerical Recipes (William Press et al.) Performs many numerical calculations Can be used with different programming languages LAPACK Performs algebraic and matrix calculations Can be used with different programming languages

Future: Data Analysis using Database Automated data analysis: Select data from DB using C# routines with SQL scripts embedded Perform computations Output results to DB, if necessary (MS SQL Server. Source: Alex Szalay)

Basic Description of the Data: Mean, Median, Mode Mean = average value Median = middle value Mode = most common value E.g. 10 sampled values = 3.4 4.8 8.4 9.6 2.3 9.6 5.6 9.6 4.8 2.2 (3) (4) (7) (8) (2) (9) (6) (10) (5) (1) n = 10 Mean = Sum of values / n = 6.03 Median = (4.8 + 5.6) / 2 Mode =

Basic Description of the Data: Mean, Median, Mode Mean = average value Median = middle value Mode = most common value E.g. 10 sampled values = 3.4 4.8 8.4 9.6 2.3 9.6 5.6 9.6 4.8 2.2 (3) (4) (7) (8) (2) (9) (6) (10) (5) (1) n = 10 Mean = Sum of values / n = 6.03 Median = (4.8 + 5.6) / 2 Mode = 9.6

Width of a Distribution: Standard Deviation (or SD, σ ) σ Width (+299,000 km/s)

Shape of a Distribution In general, sampled values may not form a well-quantified distribution (or, not analytical). We can use the binned histogram to report the shape of the distribution. Advanced: We can fit a linear combination of multiple functions.

Basic Statistics in R

Key Concepts in Statistics Variables Population Distribution vs. Sampling Distribution Central Limit Theorem (CLT)

A Big Bag of Marbles Suppose we want to measure the average and variance of the weight of a big bag of marbles. How to do it? We draw a sample from the population.

Random Variable (X) X s are Independent: the outcome of any one experiment does not influence the outcomes of others (random draws). X s are Identically Distributed: every X is drawn from the same distribution (same bag of marbles). X 1 X 2 X 3 X 4 X n-1 X n E.g., 1.1g 0.8g 1.0g 0.9g 0.7g 1.2g

Random Variable (X) X s are Independent: the outcome of any one experiment does not influence the outcomes of others (random draws). X s are Identically Distributed: every X is drawn from the same distribution (same bag of marbles). X 1 X 2 X 3 X 4 X n-1 X n E.g., 1.1g 0.8g 1.0g 0.9g 0.7g 1.2g IID Variable

Types of Random Variables Discrete variables E.g., Photon Count: 10, 1, 6, 2, 14, 3, 5, 7, E.g., Binary States: 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, Continuous variables E.g., Luminosity (in units of solar luminosity): 0.14, 2.46, 1.57, 4.52,

Height of Trees in Forest Sample Vs. Population

Color of Galaxies in Universe Credit: Miguel Angel Aragon Calvo

Statistical Distributions There are many statistical distributions available to describe experimental results. The most common ones in Astronomy are: Gaussian Distribution Poisson Distribution Planck Distribution

Gaussian Distribution (or Bell Curve, Normal Distribution ) (Carl Friedrich Gauss, 1777-1855)

Uncertainty in Physical Measurements For many experiments and observations concerning physical phenomena, we find that performing the procedure twice under (what seem!!!) identical conditions results in two different outcomes. Such kind of outcomes follows a Gaussian Distribution. For example: Michelson and Morley s speed-of-light experiment.

(+299,000 km/s) (Credit: Wikipedia)

Random Error (+299,000 km/s) (Credit: Wikipedia)

Main Lesson Michelson and Morley repeated the experiment many times to estimate the speed-of-light. They only knew the mean and SD of the population (X i ) exist, but they did not know of the values. It works because of the Central Limit Theorem. *Read first two paragraphs of the Handout.

Poisson Distribution The probability of the number of events (occurring in a fixed period of time) given a known, expected count. Probability (Siméon Poisson, 1781-1840) k (= number of events)

Probability Examples of Poisson Distributions I know from experience that I get 4 phone calls a day (λ = 4). Then there is about 20% chance I will get 3 phone calls today. Number of events

Probability Examples of Poisson Distributions I know from experience that I get 4 photons a day (λ = 4). Then there is about 20% chance I will get 3 photons today. Number of events

Probability: Chance of Occurrence Discrete Variable (e.g., 1, 2, 3, 4, 5, 6) Continuous Variable E.g., Experimental results follow a Gaussian (also called Normal) X Probability of getting X between two values is the area under the Standard Normal Distribution between the two said values. A Standard Normal Distribution is a Normal Distribution with total area = 1.

Random Number: Gaussian Distribution as a Case Study When we analyze big datasets, sampling methods are commonly used. Random number plays an important role in sampling strategies. T 3 T 6 T 5 T 1 T 4 T 2 T 7 Suppose we have a piece of metal with temperature measurement T(x, y). We want to estimate the average temperature without looking at all data (i.e., the population).

Distribution of 10 Random Numbers from Normal Distribution

Distribution of 100 Random Numbers from Normal Distribution

Distributions in Astronomy Gaussian Distribution Poisson Distribution Planck Distribution Etc.

Emission Lines in Galaxies: Gaussian Fitting single or multiple Gaussian functions to the line profiles in a galaxy spectrum.

Frequency Colors of Galaxies in Nearby Universe: Multiple Gaussians Color (Redder ) (Yip, Connolly, Szalay, et al. 2004)

Seeing Disk of Stars: Gaussian

Cause of Seeing Disk of Stars: Atmospheric Disturbance

Photon Counts in Astronomical Images: Poisson Distribution Signal-to-Noise Ratio (SNR) = Count/Random Error of Count = Count (for Poisson) That is, when Count increases, SNR increases.

Cosmic Microwave Background: Planck Distribution Universe expands and cools down. The photons were hot, but cold now (3 Kelvin).

Black Body Radiation from the Sun: Planck Distribution (Max Planck, 1858 1947) Why is the Sun not a perfect Black Body (or Planck Distribution)?