Big Data for Science (and other) Students Randall Pruim Calvin College
Questions to My Colleagues 1. What is the largest data set that students encounter in your respective majors? 2. In the context of your disciplines, when someone says big data, what do you think of? How big? What application areas?
Me: How big is your data? A Chemist Responds Chemist: The biggest datasets I use are in my research: about 500 by 100 absorbance values. The biggest I know of in chemistry would probably be chromatography mass spec data, which can be many thousands of Mass spec scans which each have 10 100 thousand data values. Another possibility would be 2D NMR, but I am not sure how big those datasets are.
Me: How big is your data? A Chemist Responds Chemist: The biggest datasets I use are in my research: about 500 by 100 absorbance values. The biggest I know of in chemistry would probably be chromatography mass spec data, which can be many thousands of Mass spec scans which each have 10 100 thousand data values. Another possibility would be 2D NMR, but I am not sure how big those datasets are. Me: Follow-up question: What is the biggest data students see in your classes? Chemist: Nada, really
Me: How big is your data? Another Chemist Responds Chemist: I think this is related to one reason why there s more traction for statistics in the curriculum in biology and medicine than in chemistry. I don t think I ve ever come across big data as a chemistry student or as a professor in the chemistry that I teach (General and Physical). Certainly we do sometimes have instruments that generate big data files, but that s just because we have a spectrometer that measures absorbance for each tenth of a nanometer for an interval of 500 nanometers or we have a probe that measures temperature twice per second for an hour. I don t think these are the sort of data that you have in mind when you say big data. I think the closest that we get to big data [as chemists] is in biochem and bioinformatics.
Me: How big is your data? A Physicist Responds Physicist: By modern standards, this is not big data, but nowadays I use oscilloscopes which can return 5 columns, each of 500,000 lines. More typical would be 2-3 columns and 16k lines. Either way, the data sets are too big to cut and paste, so I ve actualy learned how to read such files into Sage from a computer desktop. Also, I think you should ask [our astronomer], who uses asteroid databases with up to 500k objects, giving > 5 parameters for each one.
Questions to My Colleagues 1. What is the largest data set that students encounter in your respective majors? 2. In the context of your disciplines, when someone says big data, what do you think of? How big? What application areas? Main Stories: There is an enormous difference in scale between classes, student research, and disciplinary biggies. Science faculty are only vaguely familiar with really large data sets Many faculty never work with anything but very modestly sized data Some angst about the approaching big data train
Big is when your workflow breaks
Big is when your workflow breaks Physicist: I know my work flow changed when data sets of > 64k could no longer use (former versions of) Excel as a place to copy & paste and then edit.
Peering Around the Bend
Is There a Light at the End of the Tunnel?
Is Big Data Primarily an HR Problem?
Or is it an IT problem?
Harnessing the Deluge
The CVC A group of liberal arts colleges have formed a Computation and Visualization Consortium (CVC) to address issues of curriculum and faculty development. Faculty from Math, Stat, Bio, Chem, CS involved 2 months into 4-year plan Attempting to identify key skills and ways to teach them Faculty development already identified as a key component
Some CVC First Steps At St Olaf, Intro Programming is being taught using Python and R and focusing on data-related programming tasks At Macalester, all science students will take a 1-hour DCF course At Smith, a data science course is being introduced this fall At Calvin, an NSF grant is funding redesign of biology laboratories that make more substantial use of chemistry, mathematics and data analysis and physics classes are using sage/python Other institutions still in planning phases Project MOSAIC is working to make using R easier earlier in the curriculum Institutions with cross-departmental effort are progressing most quickly
DCF at Macalester Data and Computation Fundamentals 1-hour course for all science students taught in 7 weekly sessions during first year Example skills RStudio and reproducible analysis (RMarkdown) Data: curation; import; tidying and cleaning Graphical (and numerical summaries) of data Split/Apply/Combine Database light (join, merge, groupby) Modeling/Fitting with functions and smoothers
DCF at Macalester
DCF at Macalester
Ordway Bird Data DCF at Macalester Week 2 Example data on 7000 birds (weight, time of capture, species, sex, etc.) data cleaning required Example tasks (all answered with plots) 1. How many total birds per month? 2. How does the weight differ by species, wing chord and tail length? Make a scatter plot of mean weight by mean wing chord for each species using color is tail length, and the diameter is the standard deviation of weight. Make a similar scatter plot of the individual birds (leaving off sd). Compare with the previous plot. 3. Any trends over hour of the day? 4. Within species, does mixture of sexes depend on the time of year? 5. How would you identify a migratory species?
Some Questions 1. What is big data? Size? Structure? Hygene? Workflow?
Some Questions 1. What is big data? Size? Structure? Hygene? Workflow? 2. What are the key (big) data skills? Programming? Databases? Concepts?
Some Questions 1. What is big data? Size? Structure? Hygene? Workflow? 2. What are the key (big) data skills? Programming? Databases? Concepts? 3. Who needs (big) data skills? Science Majors? Stat Majors? CS majors? Special subsets?
Some Questions 1. What is big data? Size? Structure? Hygene? Workflow? 2. What are the key (big) data skills? Programming? Databases? Concepts? 3. Who needs (big) data skills? Science Majors? Stat Majors? CS majors? Special subsets? 4. When/how will these students get these skills? Special courses/programs? Thread through all courses?
Some Questions 1. What is big data? Size? Structure? Hygene? Workflow? 2. What are the key (big) data skills? Programming? Databases? Concepts? 3. Who needs (big) data skills? Science Majors? Stat Majors? CS majors? Special subsets? 4. When/how will these students get these skills? Special courses/programs? Thread through all courses? 5. Who will take the lead on Big Data Education Statisticians? Computer Scientists? Natural Scientists?
Our ability to statistically analyze data has grown significantly with the maturing of computer hardware and software. However, the evolution of our statistics capabilities has taken place without a corresponding evolution in the curriculum for the undergraduate chemistry major. Most faculty understands the need for a statistical educational component, but there is little consensus as to the exact nature of what is to be taught and who should teach it. Because of the large number of courses required for the undergraduate chemistry major, it seems unlikely that requiring a course on statistics will be practical at most institutions. Additionally, it is unlikely that the typical high school education will address the needed statistics or the software training to prepare students for the chemistry courses. Therefore, the chemistry faculty must teach the statistics needed by the majors. The faculty needs to focus on statistics useful to the chemist and this is distinctly different than what is often encountered in biology, medicine, psychology, and business. A starting point is suggested for a discussion on a statistics curriculum that addresses the needs of the chemistry majors. Nicholas E. Schlotter, A Statistics Curriculum for the Undergraduate Chemistry Major, Journal of Chemical Education 2013 90 (1), 51-55.
Some Questions 1. What is big data? Size? Structure? Hygene? 2. What are the key (big) data skills? (Programming? Programming? Databases? Concepts? 3. Who needs (big) data skills? Science Majors? Stat Majors? Special subsets? 4. When/how will these students get these skills? Special courses/programs? Thread through all courses? Taught in subject area or by external specialists? 5. Who will take the lead on Big Data Education Statisticians? Computer Scientists? Natural Scientists?
Some Questions 1. What is big data? Size? Structure? Hygene? 2. What are the key (big) data skills? (Programming? Programming? Databases? Concepts? 3. Who needs (big) data skills? Science Majors? Stat Majors? Special subsets? 4. When/how will these students get these skills? Special courses/programs? Thread through all courses? Taught in subject area or by external specialists? 5. Who will take the lead on Big Data Education Statisticians? Computer Scientists? Natural Scientists? 6. Must we walk before we run? Can/should we get to the good/big stuff right away?
Where are we now? What is Big Data? Some First Steps Questions Thanks All of this is a work in progress that would not be as far as it is and will not get as far as it can go without the help of others. Co-conspirators Danny Kaplan Libby Shoop Nick Horton Macalester C Macalaster C Amherst C The Computation and Visualization Consortium My science colleagues at Calvin The team at RStudio