Exercises for Data Visualisation for Analysis in Scholarly Research Exercise 1: exploring network visualisations 1. In your browser, go to http://bit.ly/11qqxuj 2. Scroll down the page to the network graph. 3. Take a few minutes to explore the visualisation: try holding the cursor over items, clicking, dragging, etc. 4. You can see the same data represented in other graphs below. 5. Discuss with your neighbour: does interacting with the network graph give you more or less information than the other visualisations? Does it open up new questions? Time: approx 5 minutes Exercise 2: comparing N-gram tools NB: in both tools, copyright affects the availability of 20th century books. Transcription errors may also affect results, particularly for older books (e.g. the 'long s' vs f) 1. Think of two words or phrases you d like to compare over time (e.g. Burma, Burmah). 2. Open two browser windows 3. In one, go to http://books.google.com/ngrams 4. In the other, go to http://bookworm.culturomics.org 5. Enter your words or phrases in each and compare the results 6. Discuss with your neighbour: what differences did you find, and what might have caused them? Google Ngram tips: http://books.google.com/ngrams/info Background information on Bookworm: http://bookworm.culturomics.org/ol.html Bookworm tips: click the 'cog' icon next to the 'i' to change the time period or click the underlined words next to the search term to change which books are searched (e.g. subject, language, country, gender of author). [Image on slide] Time: approx 5 minutes 1
Exercise 3: scholarly visualisations In pairs, you will be asked to pair with your neighbour and explore and discuss one of the following visualisations: University of Richmond, Visualizing Emancipation http://www.americanpast.org/emancipation/ Further information: http://dsl.richmond.edu/emancipation/about/ http://dirt.terrypbrock.com/2012/04/visualizing- emancipation- examining- its- process- through- digital- tools/ Stanford "Mapping the Republic of Letters" http://www.stanford.edu/group/toolingup/rplviz/rplviz.swf Further information: http://openglam.org/2012/03/21/mapping- the- republic- of- letters/, http://danbri.org/words/2010/11/22/603, http://republicofletters.stanford.edu/tools/ GAPVis http://gap.alexandriaarchive.org/gapvis/index.html#index Further information: http://googleancientplaces.wordpress.com/ Digital Harlem :: Everyday Life 1915-1930 http://www.acl.arts.usyd.edu.au/harlem/ Further information: http://digitalharlemblog.wordpress.com/, http://writinghistory.trincoll.edu/evidence/robertson- 2012- spring/ Instructions 1. In your browser, go to suggested site 2. Take a few minutes to explore the visualisation 3. Discuss with your neighbour: what do you think is being presented here? What stories or trends can you start to see? Does it work better at one scale over another? Do you find it more effective at aggregate or detail level? Does it present an argument or provide a space to develop and explore one? If it was designed to present an argument or investigate a particular question, what do you think that was? What have you learned from visualisation that you might not have learned from looking at the data or reading text about it? 4. Report back to the group: summarise the site's purpose, visualisation formats and data types in a sentence, then share the most interesting parts of your discussion Time: approx 10 minutes 2
Exercise 4: cleaning data in Refine In this exercise, we will be cleaning an existing dataset to prepare it for visualisation as a chart and map in the next exercise. We want to investigate different disciplines within this data later, so we will review the data, remove rows that are missing certain values (in real life we would try to do some research to fill in the gaps before resorting to this!), and make the remaining values more consistent. Don't worry if you make a mistake, you can click 'Undo/Redo' and go back as many steps in your history as you need. 1. Start the Refine application. (If there isn't a shortcut for it on the computer desktop, go to the Start menu and search for Refine). Double- click the.exe file (on Windows) to start the application. It will open a page in your browser. Tip: if it opens in Internet Explorer, copy the page address from the URL/location bar and paste it into Chrome or Firefox. 2. Create a new project. [See slide 'Exercise 4: data cleaning in Refine'] a. On the left- hand side of the browser page, click on 'Create project'. Under 'Get data from' select 'This Computer'. Browse to the file Inspiring Women through History_April2013.xlsx, load and select 'Next' b. If the preview rows look ok, click 'Create Project'. c. Spend some time looking over the data. You will notice that not every field is been filled in for each row, and that the level of detail varies within a column (e.g. locations). 3. Start 'cleaning' the data: remove rows where 'Discipline' has been left blank. a. Click the down arrow in the Discipline column; select Facet; Text Facet [see slide] b. Refine should open a new block on the left- hand side of the screen, listing all the different disciplines in the dataset. c. Scroll to the bottom of the box and click on '(blank)'. d. This should change the rows on the right- hand side so only rows without a discipline are listed. e. To remove these rows from the dataset, click the down arrow next to 'All' (start of the row); select Edit rows; Remove all matching rows. f. This should remove 23 rows. 4. Let Refine find close matches in column values. a. Click 'reset' in the blue bar of the left- hand Discipline box (this resets the previously selected rows on the right- hand side). b. Click 'Cluster' on the left- hand Discipline box. c. A new block should pop- up, showing 'Painter' and 'Painter.', which Refine has detected as being close matches. d. Tick the 'merge' box and then click 'Merge selected and close'. e. Congratulations! You have merged some messy records! 3
5. Manually review and simplify the number of disciplines listed. In this exercise we will manually reduce the number of Disciplines in the dataset by merging them so that it's easier for them to be displayed in a simple visualisation. a. Click on 'count' in the line 'Sort by name count'. b. The most common disciplines should be at the top. c. Review the disciplines. There is one row each for Biologist, Botanist, Chemist, Natural Historian and Inventor. We will merge each of them with 'Scientist'. d. Hover your cursor over 'Inventor' so that 'edit' appears as an option. Click 'edit'. e. In the box that appears, type 'Scientist' and click 'Apply'. f. Repeat for the rows for Biologist, Botanist, Chemist and Natural Historian. g. There should now be 13 rows for Scientist. h. Edit any other Disciplines as you wish. You can review the values by clicking on the Discipline name and scrolling through the rows to work out which values to simplify. For example, Poet might become Writer, Surgeon and Doctor might both become Medicine. 6. When you're done tidying the values for Discipline, click the 'x' next to Discipline in the left- hand side box to 'remove this facet'. 7. Repeat the steps above to remove rows where 'Birth_place' is missing. 8. If you wish, repeat step 4 (Let Refine find close values) on the Birth place field to tidy up uncertain values like 'London?' 9. Optional: remove numeric rows with missing values. Because we need a year to build a timeline, we'll remove rows without a currently known birth year. a. Scroll sideways through the columns until 'Birth_year' is on the screen. b. Click the down arrow next to 'Birth_year'; select Facet; Numeric facet. c. A 'Birth_year' block should appear on the left- hand side d. Untick the 'numeric' box. e. Only rows without a number in 'Birth_year' should be shown now. f. Scroll back to the left until 'All' is showing. To remove these rows from the dataset, click the down arrow next to 'All' (start of the row); select Edit rows; Remove all matching rows. g. Close the 'Birth_year' box on the left- hand side by clicking the 'x'. 10. To export your data (ready for use in another application), click on 'Export' on the right- hand side, select an option like Excel or comma- separated value and save the file. Congratulations, you have cleaned data with Refine! We now have data we can use to make a nice timeline. You can use the same principles to clean or reduce the complexity of huge datasets. You can find out more at http://freeyourmetadata.org/cleanup/ or https://github.com/openrefine/openrefine/wiki/documentation- For- Users Time: approx 20 minutes 4
Exercise 5: trying entity recognition 1. In your browser, go to http://nlp.stanford.edu:8080/corenlp/process 2. Find a short paragraph of text (e.g. from a news site or digitised text) to paste into the box 3. How many of the things (concepts, people, places, events, references to time or dates, etc) you recognise did it pick up? Is any of the other information presented useful? Time: approx 5 minutes Exercise 6: create a pie chart using Google Fusion Tables 1. Go to https://drive.google.com/ and log into Google (if you aren't already) 2. Go to http://bit.ly/xw0znj (or https://www.google.com/fusiontables/datasource?dsrcid=implicit&redirectpat h=data&usp=apps_start&hl=en) to access Fusion from your account. 3. You should see a screen 'Import new table' and 'From this computer'. 4. Click 'Choose file', find the file you exported from Refine on your computer and upload it. 5. Leave the default options and click 'Next' until the screen asks you to choose a table name, 'attribute data to', enter a description, etc. Fill in the values as appropriate and click 'Finish'. 6. Check that you are in 'Classic' rather than 'New' mode in Google Drive. If the menu items along the page say 'File', 'View', 'Edit', 'Visualize', 'Merge', 'Labs', you are in Classic mode. If you are in New mode, click on the 'Help' menu and select 'Back to Classic look'. 7. From the Visualize menu, choose 'Pie'. [See slide 'Pre- aggregation view of a pie chart'] 8. Click 'options' then 'Aggregate'. 9. In the 'Aggregated by' section, select 'Discipline', and click 'Apply'. [Slide: 'Change options, aggregate by Discipline'] 10. Set the Entity to Discipline and the value to Count. 11. You should have a pie chart of your data! Time: approx 5 minutes 5
Exercise 7: geocoding data and creating a map using Google Fusion Tables Google Fusion Tables can geocode data directly from the table, but it sometimes needs some help. Fusion will have recognised some columns as containing location data, but it will not know much about those locations. The best way to see how it copes with a dataset is to geocode it and look over the resulting map to check that records have ended up in the right place. Before you begin, remove any aggregated fields or filters set in the last exercise: click 'options', then 'Clear filter' in the Filter tab view and 'Clear aggregation' on the 'Aggregate' tab view. Change into 'New' from 'Classic' mode by clicking 'Switch to new look' on the top right- hand corner of the browser window. If you can't see that option, let me know! Geocode your data 1. From the Visualize menu, choose 'Table' 2. From the File menu, choose 'Geocode' 3. Change the value for 'Location column' to 'Combined_birthplace_location' and click 'Begin geocoding'. (The Combined_birthplace_location field hopefully contains enough information to stop London, England, being confused with London, Ontario.) 4. Wait a bit... 5. Congratulations, you have geocoded some data! Create a map from your geocoded data 6. Click the '+' at the end of the row that starts with the File option 7. Click the down arrow next to the new maps name and change the value for 'Select location' to the field Combined_birthplace_location. 8. Congratulations! You've created a map! If you have extra time or curiosity, try changing the options under 'Change info window layout...' and 'Change map styles...'. Time: approx 10 minutes 6
Exercise 8: Choose your own adventure Options: explore and analyse more visualisations try making different visualisations with the current dataset try creating visualisations with your own data Exploring and analysing more visualisations There are links to visualisation blogs and other specialist sites on the Resources post at http://bit.ly/ujwgez (i.e. http://www.miaridge.com/resources- for- data- visualisation- for- analysis- in- scholarly- research/) For each visualisation, you might like to consider: what sources have they used? Do they explain how they've prepared them? What effect have their choices of visualisation formats and tools had? What data or queries are prioritised, and which are more difficult or impossible? If you have a particular type of data, process, format or audience in mind, ask for suggestions for sites to look at! Making more visualisations ManyEyes is a useful tool for learning, but as it uses Java it can be tricky in classroom situations. You can create visualisations from data other people have uploaded without signing up at ManyEyes: http://www- 958.ibm.com/software/analytics/manyeyes/datasets You can also try visualising the data from the British Library Pin- a- tale project, available in Google Docs at http://bit.ly/wt1ai5 Try a visualisation and evaluate the results. Is more cleaning or transformation needed? You may need to iterate with different versions of your data after cleaning or enhancing it. If you have your own dataset, review the 'planning' slides. What do you want to learn or express about your data? The ManyEyes site provides some guidance on the best visualisation types for different types of data: http://www- 958.ibm.com/software/analytics/manyeyes/page/Visualization_Options.html which can also be useful for planning visualisations in Google Fusion. More data cleaning and linking (reconciling) to other data You may want to go back to Refine and do more cleaning, or try reconciling the data to pull in more related information. 7
For example, you could find rows with more than one Spouse or Place lived listed, and separate the values into different columns. You could experiment with different levels of location values - does it matter whether England or United Kingdom is listed? You can find out more about cleaning inconsistent values at https://github.com/openrefine/openrefine/wiki/clustering and https://github.com/openrefine/openrefine/wiki/clustering- In- Depth There are some instructions for reconciling data at http://freeyourmetadata.org/reconciliation/ 8