Big Data for Science (and other) Students



Similar documents
Teaching Precursors to Data Science in Introductory and Second Courses in Statistics

Teaching Computational Thinking using Cloud Computing: By A/P Tan Tin Wee

Important Steps to Becoming a Psychiatrist. Courses to take as an undergraduate, besides typical PSY courses (Intro, Research Methods, Stats):

Rethinking the Haverford College Chemistry Department: Curriculum and Teaching Methods

Computer and Information Scientists $105, Computer Systems Engineer. Aeronautical & Aerospace Engineer Compensation Administrator

October 2, 2003 Don Genson, director Sloan Master s Programs, Eberly College of Science, Penn State dwg9@psu.edu

The College of Liberal Arts and Sciences

POL 204b: Research and Methodology

DATA SCIENTIST TRAINING FOR LIBRARIANS #DST4L. C. Erdmann Designing Libraries

Important Steps to a Masters or PhD in Public Health

Department of Computer Science and Engineering

Computational Statistics: A Crash Course using R for Biologists (and Their Friends)

Russ College of Engineering and Technology. Revised 9/ Undergraduate GPA of 3.0/4.0 or equivalent.

Bachelor of Biomedical Science

Preparing for Career Success in Science, Technology, Engineering and Mathematics. C a r e e r C l u s t e r s F o c u s i n g education on the future

University of Wisconsin-Madison Department of Chemical and Biological Engineering Curriculum Guide for Chemical Engineering Undergraduates

COMPUTER SCIENCE AND ENGINEERING

Important Steps to Becoming an MA/MS or PhD in Engineering Psychology and Human Factors

The University of Connecticut. School of Engineering COMPUTER SCIENCE GUIDE TO COURSE SELECTION AY Revised May 23, 2013.

KNIME Enterprise server usage and global deployment at NIBR

COMPUTER SCIENCE: MISCONCEPTIONS, CAREER PATHS AND RESEARCH CHALLENGES

COS 140: Foundations of Computer Science

Chemistry Department Strategic Planning

Physics (Department, Major, Courses, Faculty)

Computational Science and Informatics (Data Science) Programs at GMU

1. Rationale. The Society for Industrial and Applied Mathematics (SIAM) has just published its. A. Background

What You Need to Know About Computer Science, B.S.A.T. Programs

Bringing Real World Practice into an Organic Chemistry Class. University of Wisconsin-Madison. The Institute on Learning Technology

Top 20 National Universities. Undergraduate Curricula and Graduate Expectations

Critical thinking (7 courses total, 5 Foundations and 2 Applications) An illustration of how the categories of area of the grid could be developed

If your schedule allows you might consider joining ROTC.

Biology meets Engineering

DIGITAL FORENSICS SPECIALIZATION IN BACHELOR OF SCIENCE IN COMPUTING SCIENCE PROGRAM

CHEMICAL EDUCATION AND NEW EDUCATIONAL TECHNOLOGIES : AN INTER-UNIVERSITY PROGRAM FOR GRADUATE STUDIES

GRADUATE CATALOG LISTING

Chemistry/Biochemistry

This focus on common themes has led to IFNA s motto of understanding through GLOBAL DIVERSITY, COOPERATION AND COLLABORATION.

Elizabeth City State University Elizabeth City, NC

Computer Science Curriculum Revision

The College of Science Graduate Programs integrate the highest level of scholarship across disciplinary boundaries with significant state-of-the-art

Analytics For Everyone - Even You

CALIFORNIA STATE UNIVERSITY, EAST BAY. Committee on Academic Planning & Review (CAPR)

So, you want to create a game degree program. Monica M. McGill Bradley University Peoria, IL

The following are the measurable objectives for graduated computer science students (ABET Standards):

Educational and research resources in creating a community. Valérie de Crécy-Lagard & Matt Gitzendanner

A Survey of Image Processing Tools Package in Medical Imaging

Bachelor of Science in Chemistry

BSc and BA&Sc Interdisciplinary Programs Faculty of Science, McGill University

PH.D. PROGRAM IN COMPUTATIONAL SCIENCE CONCENTRATION IN COMPUTATIONAL BIOLOGY & BIOINFORMATICS (Quantitative Biology)

The Data Engineer. Mike Tamir Chief Science Officer Galvanize. Steven Miller Global Leader Academic Programs IBM Analytics

Appendix D: Summary of studies on online lab courses

Examination Credit and Transfer Credit

18.2 Comparing Atoms. Atomic number. Chapter 18

User Testing for Pre-Phase 2, Search 360 Janifer Holt and Barbara DeFelice

Physics in the Liberal Arts College

Dual Degree Program Course Requirements

Undergraduate Degree Map for Completion in Four Years

Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics

University of Wisconsin-Madison Department of Chemical and Biological Engineering. Curriculum Guide for Chemical Engineering Undergraduates

This Plan of Study Form is for a (Circle One): DECLARATION REVISION

General Education Transfer Credit Agreement Catalog

Developing and Delivering Online Math and Science Teacher Education Programs With Ten African Countries

TIBCO Spotfire Helps Organon Bridge the Data Gap Between Basic Research and Clinical Trials

Pre-Masters. Science and Engineering

Cluster Computing and Network Marketing Systems

Kinesiology Department Request for a Change of Name to: The Department of Kinesiology and Health Sciences

GUIDELINES FOR THE VALIDATION OF ANALYTICAL METHODS FOR ACTIVE CONSTITUENT, AGRICULTURAL AND VETERINARY CHEMICAL PRODUCTS.

Proposal for New Program: BS in Data Science: Computational Analytics

Health Informatics Student Handbook

BACHELOR OF SCIENCE IN CHEMISTRY

Sample schedule for ChE/BME Dual Majors in the BMTE Track

Transcription:

Big Data for Science (and other) Students Randall Pruim Calvin College

Questions to My Colleagues 1. What is the largest data set that students encounter in your respective majors? 2. In the context of your disciplines, when someone says big data, what do you think of? How big? What application areas?

Me: How big is your data? A Chemist Responds Chemist: The biggest datasets I use are in my research: about 500 by 100 absorbance values. The biggest I know of in chemistry would probably be chromatography mass spec data, which can be many thousands of Mass spec scans which each have 10 100 thousand data values. Another possibility would be 2D NMR, but I am not sure how big those datasets are.

Me: How big is your data? A Chemist Responds Chemist: The biggest datasets I use are in my research: about 500 by 100 absorbance values. The biggest I know of in chemistry would probably be chromatography mass spec data, which can be many thousands of Mass spec scans which each have 10 100 thousand data values. Another possibility would be 2D NMR, but I am not sure how big those datasets are. Me: Follow-up question: What is the biggest data students see in your classes? Chemist: Nada, really

Me: How big is your data? Another Chemist Responds Chemist: I think this is related to one reason why there s more traction for statistics in the curriculum in biology and medicine than in chemistry. I don t think I ve ever come across big data as a chemistry student or as a professor in the chemistry that I teach (General and Physical). Certainly we do sometimes have instruments that generate big data files, but that s just because we have a spectrometer that measures absorbance for each tenth of a nanometer for an interval of 500 nanometers or we have a probe that measures temperature twice per second for an hour. I don t think these are the sort of data that you have in mind when you say big data. I think the closest that we get to big data [as chemists] is in biochem and bioinformatics.

Me: How big is your data? A Physicist Responds Physicist: By modern standards, this is not big data, but nowadays I use oscilloscopes which can return 5 columns, each of 500,000 lines. More typical would be 2-3 columns and 16k lines. Either way, the data sets are too big to cut and paste, so I ve actualy learned how to read such files into Sage from a computer desktop. Also, I think you should ask [our astronomer], who uses asteroid databases with up to 500k objects, giving > 5 parameters for each one.

Questions to My Colleagues 1. What is the largest data set that students encounter in your respective majors? 2. In the context of your disciplines, when someone says big data, what do you think of? How big? What application areas? Main Stories: There is an enormous difference in scale between classes, student research, and disciplinary biggies. Science faculty are only vaguely familiar with really large data sets Many faculty never work with anything but very modestly sized data Some angst about the approaching big data train

Big is when your workflow breaks

Big is when your workflow breaks Physicist: I know my work flow changed when data sets of > 64k could no longer use (former versions of) Excel as a place to copy & paste and then edit.

Peering Around the Bend

Is There a Light at the End of the Tunnel?

Is Big Data Primarily an HR Problem?

Or is it an IT problem?

Harnessing the Deluge

The CVC A group of liberal arts colleges have formed a Computation and Visualization Consortium (CVC) to address issues of curriculum and faculty development. Faculty from Math, Stat, Bio, Chem, CS involved 2 months into 4-year plan Attempting to identify key skills and ways to teach them Faculty development already identified as a key component

Some CVC First Steps At St Olaf, Intro Programming is being taught using Python and R and focusing on data-related programming tasks At Macalester, all science students will take a 1-hour DCF course At Smith, a data science course is being introduced this fall At Calvin, an NSF grant is funding redesign of biology laboratories that make more substantial use of chemistry, mathematics and data analysis and physics classes are using sage/python Other institutions still in planning phases Project MOSAIC is working to make using R easier earlier in the curriculum Institutions with cross-departmental effort are progressing most quickly

DCF at Macalester Data and Computation Fundamentals 1-hour course for all science students taught in 7 weekly sessions during first year Example skills RStudio and reproducible analysis (RMarkdown) Data: curation; import; tidying and cleaning Graphical (and numerical summaries) of data Split/Apply/Combine Database light (join, merge, groupby) Modeling/Fitting with functions and smoothers

DCF at Macalester

DCF at Macalester

Ordway Bird Data DCF at Macalester Week 2 Example data on 7000 birds (weight, time of capture, species, sex, etc.) data cleaning required Example tasks (all answered with plots) 1. How many total birds per month? 2. How does the weight differ by species, wing chord and tail length? Make a scatter plot of mean weight by mean wing chord for each species using color is tail length, and the diameter is the standard deviation of weight. Make a similar scatter plot of the individual birds (leaving off sd). Compare with the previous plot. 3. Any trends over hour of the day? 4. Within species, does mixture of sexes depend on the time of year? 5. How would you identify a migratory species?

Some Questions 1. What is big data? Size? Structure? Hygene? Workflow?

Some Questions 1. What is big data? Size? Structure? Hygene? Workflow? 2. What are the key (big) data skills? Programming? Databases? Concepts?

Some Questions 1. What is big data? Size? Structure? Hygene? Workflow? 2. What are the key (big) data skills? Programming? Databases? Concepts? 3. Who needs (big) data skills? Science Majors? Stat Majors? CS majors? Special subsets?

Some Questions 1. What is big data? Size? Structure? Hygene? Workflow? 2. What are the key (big) data skills? Programming? Databases? Concepts? 3. Who needs (big) data skills? Science Majors? Stat Majors? CS majors? Special subsets? 4. When/how will these students get these skills? Special courses/programs? Thread through all courses?

Some Questions 1. What is big data? Size? Structure? Hygene? Workflow? 2. What are the key (big) data skills? Programming? Databases? Concepts? 3. Who needs (big) data skills? Science Majors? Stat Majors? CS majors? Special subsets? 4. When/how will these students get these skills? Special courses/programs? Thread through all courses? 5. Who will take the lead on Big Data Education Statisticians? Computer Scientists? Natural Scientists?

Our ability to statistically analyze data has grown significantly with the maturing of computer hardware and software. However, the evolution of our statistics capabilities has taken place without a corresponding evolution in the curriculum for the undergraduate chemistry major. Most faculty understands the need for a statistical educational component, but there is little consensus as to the exact nature of what is to be taught and who should teach it. Because of the large number of courses required for the undergraduate chemistry major, it seems unlikely that requiring a course on statistics will be practical at most institutions. Additionally, it is unlikely that the typical high school education will address the needed statistics or the software training to prepare students for the chemistry courses. Therefore, the chemistry faculty must teach the statistics needed by the majors. The faculty needs to focus on statistics useful to the chemist and this is distinctly different than what is often encountered in biology, medicine, psychology, and business. A starting point is suggested for a discussion on a statistics curriculum that addresses the needs of the chemistry majors. Nicholas E. Schlotter, A Statistics Curriculum for the Undergraduate Chemistry Major, Journal of Chemical Education 2013 90 (1), 51-55.

Some Questions 1. What is big data? Size? Structure? Hygene? 2. What are the key (big) data skills? (Programming? Programming? Databases? Concepts? 3. Who needs (big) data skills? Science Majors? Stat Majors? Special subsets? 4. When/how will these students get these skills? Special courses/programs? Thread through all courses? Taught in subject area or by external specialists? 5. Who will take the lead on Big Data Education Statisticians? Computer Scientists? Natural Scientists?

Some Questions 1. What is big data? Size? Structure? Hygene? 2. What are the key (big) data skills? (Programming? Programming? Databases? Concepts? 3. Who needs (big) data skills? Science Majors? Stat Majors? Special subsets? 4. When/how will these students get these skills? Special courses/programs? Thread through all courses? Taught in subject area or by external specialists? 5. Who will take the lead on Big Data Education Statisticians? Computer Scientists? Natural Scientists? 6. Must we walk before we run? Can/should we get to the good/big stuff right away?

Where are we now? What is Big Data? Some First Steps Questions Thanks All of this is a work in progress that would not be as far as it is and will not get as far as it can go without the help of others. Co-conspirators Danny Kaplan Libby Shoop Nick Horton Macalester C Macalaster C Amherst C The Computation and Visualization Consortium My science colleagues at Calvin The team at RStudio