# Tammy Pirmann HS CS teacher in PA NSF RET in Big Data with Temple University Teach CS Principles course

1

2 Tammy Pirmann HS CS teacher in PA NSF RET in Big Data with Temple University Teach CS Principles course Slobodan Vucetic Temple University NSF research project involving Big Data education through the pipeline

3 CS Principles: Big Idea III. Data: Data and information facilitate the creation of knowledge. CSTA K-12 Standards (5.3.A): CT 4. Compare techniques for analyzing massive data collections CPP 11. Describe techniques for locating and collecting small and large-scale data sets Job growth for data scientists

4 Slobodan Vucetic teaches a great graduate level course at Temple that uses BIG data sets He created an undergrad course based on the successful grad course He and I worked together so I would understand the data sets and how college students work with them I wrote a unit for HS students based on his undergrad course

5 In my class, this unit follows a few basic App Inventor tutorials and a lesson on abstraction Students have varying degrees of comfort with spreadsheets and databases Students have read the first two chapters of Blown to Bits by Abelson, Ledeen, Lewis

6 1. Orient activate, motivate, prepare 2. Explore observe, analyze 3. Form Concepts questions 4. Apply examples and problems 5. Close reflect and assess

7 Wear two hats Take on the role of student and see how the student interacts with the material Remain an educator and think about what you can use in your situation Break into groups of 4, making sure that at least one member has a device and the files

8

9 In 2009 Netflix offered a \$1,000,000 prize to the team that could create a movie recommendation system that was 10% better than their existing one. That prize went to BellKor s Pragmatic Chaos". In this activity, we will explore a smaller (but still very large) set of movie data to explore how data can be used to generate useful information.

10 Why is a movie recommendation system worth a million dollars to Netflix?

11 There are three interrelated sets of data The movies

12 The people The ratings

13 1. What scale is being used for recommendations? How many stars?

14 2. What information are we keeping track of for each movie?

15 3. Can a person rate more than one movie?

16 4. What information are we capturing from our users? How are we capturing this data?

17 What additional movie data would be useful?

18 What might we want to know about the people doing the ratings?

19 Is it possible for a movie to never be rated? What effect does that have?

20 How would you go about determining which movie is the "best" movie?

21 Why would people rate movies?

22 Who might use this data, and how?

23 Discuss and agree on three potential problems inherent in an online rating and recommendation system. Be prepared to report out to the class. Discuss and agree on three questions you would like the answers to based on this data. Are there any additional data points needed in order to answer any of your questions? What additional data points would it have been helpful to have access to?

24 Open the people text file How is it formatted? What type of file would you expect this type of data to be in? Why? Open the movie text file How is it different?

25 These three file are related to each other The people rated movies We will make three tabs in one excel file Video tutorial * I have a completed Excel workbook available on the next day for scaffolding, absents, etc.

26 What happened with the vote data? It turns out that Excel has limits There are too many rows in the vote data to be imported into Excel Google spreadsheets can only handle 400,000 cells!

27 One thing you should have noticed is that the data does not have any labels We need to create field labels for this data Let s start with the people tab: What do you think are good labels for the columns of data? The movie tab presents a significant problem We have a file called a read-me file that tells us what each column is

28 I have a question can we trust this data? Can I use it to say The data shows that males between 12 and 24 prefer action movies over romance movies? Do I have confidence in the demographic data? Use the sort function to sort the people data on age. What do you notice?

29 We break into small groups based on previous experience with Excel I teach sort, filter, the count function, renaming tabs Students then use this to determine the percent of people who have probably lied on the form: liars/all people

30 The original groups of students choose a question they wrote down on the first day They now determine how to go about getting the answer to that question from the data This is an analysis plan, not the actual analysis (since some of them have questions that may need a more powerful tool)

31 What genre of movies do people like me give the highest ratings to? We need to determine people like me from the people data We then need to find all the ratings provided by them We need to put those ratings into genre buckets

33 Spreadsheets gave us more tools than the text file Databases give us more tools than the spreadsheet We have a database on our computers as part of Microsoft Office Open Access

34 We will import our original txt files into Access Each file will become a table in the database The people file has an id for each person which will be defined as our primary key The movie file has an id for each movie which will be defined as our primary key The ratings file has the people id and the movie id, but no ratings id.

35 Each record in the database needs to be able to be identified The primary key is how we identify each record Since each movie can only be rated by a person once, the combination of person id and movie id can be the primary key for our vote table

36 A relational database is one where the tables of data are related to each other by the primary keys Our tables are related through the vote table The primary key of the people table is present in the vote table The primary key of the movie table is present in the vote table

37 The simplest query to write is one based on one table We will use a query to recreate a sort and filter we had done in Excel Using the people table, let s look only at the people who entered an age we consider valid Sort these records by age We can hide the postal code if we are not using it

38 Go back to your written analysis plan Write the query iteratively Start with one table and get that query working Add more complexity to your query in small chunks, checking for accuracy at each step

39 After using the Movie data to teach spreadsheets and databases, we change data sets lest the students believe that big data and recommendation systems are synonomous The Portland data is even larger than the movie data and represents the movements of the people of Portland Oregon over a 24 hour period

40 Locations - The city is divided into a grid with each square given a numeric representation Demographics - Each person has an id and demographic data associated with them Activities Each type of activity is given a numeric representation Time - measured in seconds past midnight

41 With this information, what can we learn?

42 Is it possible there are questions that we should not ask?

43 This data could be used by an urban planner to determine if the city needs a large venue in a particular part of the city It could also be used to determine if a major highway needs more capacity It could be used to predict where utilities are most needed by hour

44 This type of data could show drivers which roads have the most traffic on them Data can show us how much time people spend on their commute Companies can use this type of data to determine where to open a franchise

45 The group of students brainstorm and provide the teacher three proto-concepts for deeper analysis of the Portland data Teacher returns the concepts with one chosen for the group (to eliminate duplication) Students work together to develop that concept and find the answers in the data Students report out to the class what they did and the results

46 Two options to allow you to scaffold the project to different ability levels Both options have the same format Proposal Data acquisition Data analysis plan Final report

47 Proposal Includes the community, the question you want to answer, why it should be answered, what data will be collected and how the answer will be provided to the community Data collection plan and form Data analysis plan Final report

48 Proposal Data collection plan and form Create a form in Google Docs, disseminate the form via , forums, link on website, etc Data analysis plan Final report

49 Proposal Data collection plan and form Data analysis plan You have several tools at your disposal to analyze your data. Decide which tools you will use and why. Develop the queries, sorts and filters that you will use when the data is collected. Be sure your data analysis plan covers the main questions that originally prompted you to collect this data. Final report

50 Proposal Data collection plan and form Data analysis plan Final report Produce a written report back to the community to share the information discovered by your analysis of the data provided by the community. This report should use illustrations or charts where appropriate

51 The only difference for option 2 is that the student will find/access existing data There are many large data sets available from the government Some organizations may also have raw data for the student to work with (Scouts, church groups, etc)

52 The Portland data could be used with Processing to create animated graphs of people movement What s your idea?

